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Abstract — Virtual machine has been the most one of 
virtualization technology used today for working and saving 
hardware resources, besides as a tool conduct research on 
malware, network installations etc. The wide use of 
virtualization technology is becoming a new challenge for 
digital forensics experts to carry out further research on the 
recovery of evidence of deleted virtual machine image. This 
research tries to find out whether there is evidence of 
generated activity in the destroyed virtual vachine and how to 
find the potential of digital evidence by using the Virtual 
Machine Forensic Analysis and Recovery method. The result 
showed, the virtual machine which was removed from the 
VirtualBox library could be recovered and analyzed by using 
autopsy tools and FTK with analytical method, 4 deleted files 
in the VMDK file could be recovered and analyzed against the 
digital evidence after checking the hash and metadata in 
accordance with the original. However, Virtual machine image 
with Windows-based and Linux-based operating systems which 
was deleted using the destroy method on VirtualBox could not 
be recovered by using autopsy and FTK, even though 
VirtualBox log analysis, deleted filesystem analysis, and 
registry analysis to recover backbox.vmdk and windows 
7.vmdk does not work, due to the deletion was done using a 
high-level removal method, almost similar to the method of 
wipe removal of data on the hard drive. 

Keywords: Virtual , Forensics , Recovery , Cybercrime , Anti- 
Forensics 

I. Introduction 

igital forensics is a sequence of process of identifying, 
obtaining, analyzing and presenting evidences to the 
court to resolve a criminal case by observing and 
maintaining the integrity and authenticity of the evidence 
[1]. The applying of digital forensics in a virtual machine is 
by and large called as virtual machine forensics. However, 
this case cannot be separated from the existence of various 
techniques or other methods to remove evidences, this 
technique is commonly called anti-forensics. From such 
anti-forensic techniques, removing and restoring the VM to 
the system's initial snapshot are categorized into the tapping 
of artifact and trace removal. The process of creating and 
operating virtual machine is very easy, and even there are a 
lot of tutorial of those processes in the internet. The virtual 
machine can be run in portable mode by installing from a 
USB drive and using snapshot as the easiest removal 
technique. The motivation use of anti-forensics is to 


minimize or inhibit the discovery of digital evidence in 
criminal cases [2]. Although it has been destroyed by 
attacker, it is still possible that the file and evidence can be 
found and restored. 

Forensic investigations on virtual machines has 
brought out a challenge to investigators due to their systems 
are different from physical computer in common. It is not 
corresponding with the ease of usage and the rapidity 
development of this technology today. Most of literature 
discuss about file recovery, performance optimization, and 
security enhancements, only a few which is deal with virtual 
machine forensics. The number of computer crime cases and 
computer related crime that is handled by Central Forensic 
Laboratory of Police Headquarters at around 50 cases, the 
total number of electronic evidence in about 150 units over 
a period of time [3]. 

There is an interesting study for the author. That is, a 
paper which analyzed the virtual machine snapshot on the 
.vdi image which revealed that the files that had been 
deleted could not be recovered [4]. In general, that research 
was not using a clear method and did not accompanied by a 
clear reason why the file could not be recovered. Depart 
from that problem, author conducts more in-depth research, 
especially on destroyed virtual machine, as well as proposes 
a methodology as a basic reference for forensic analysis and 
recovery. The domain of this research is digging up 
information and conducting recovery on deleted and 
destroyed virtual machines. The method which is used to 
obtain the data is Virtual Machine Forensics Analysis & 
Recovery using static forensic acquisition method and live 
forensic acquisition. 

II. LITERATURE REVIEW 

Previous research duplicated a server into two or more 
virtual machine servers, in which each virtual machine 
image as the result of the duplication was run in a different 
VM. Various usage of VMs depended on the computing 
power, availability and cost. As a result, they presented a 
new optimization model to determine the number and type 
of VM required for each server that could minimize costs 
and ensured the availability of the SLR (Service Level 
Agreement). It also showed that the use of duplicate on 
several different VMs could be more cost-effective to run 
multiple servers in virtual machine rather limited the server 
copy to run in single VM [5]. 
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Maintaining the integrity of an original evidence is 
essential for the forensic examination process since only 
changing one bit between the gigabits will change the data 
and can not be undone and doubt the evidence being 
extracted. In traditional write-blockers, virtual machine 
forensics are used to maintain the integrity of the evidence 
and prevent the OS from altering, but it presents a more 
difficult challenge to be handled. Accessing digital storage 
is less likely to be done, usually, the only storage that can be 
accessed is a virtual hard drive. It certainly has the same 
integrity issues as real devices and with additional 
complication. In this case, it is not possible to use hardware- 
based write-blockers to prevent changes to the data. Tobin 
& Kechadi (2016) presented an implementation of their own 
write-blocker software and demonstrated how to use it in 
order to be conformable to ACPO principles in digital 
evidence [6]. 

III. BASIC THEORY 

A. Static Forensics 

Static forensics is the most method of acquisition used 
today by extracting, analyzing and obtaining electronic 
evidence which is conducted after the incident occurred. 
Static forensics technology is well developed, especially in 
aspects of digital evidence extraction, analysis, assessment, 
submission and conformity with applicable legal procedures. 
There are many ways that can be done in accordance with 
current technological developments, such as copying disk 
images, searching for information and filtering technologies, 
and others that all play an important role in digital forensic 
processes. Some static forensics tools were developed by 
various IT Security companies in the world such as text 
search tools, drive image programs, forensics toolkits, the 
comer's toolkit, ForensiX, NTI, EnCase, Autopsy, etc. 
Furthermore, it has been proven accepted by forensic 
experts that all have played an important role in the digital 
forensics process [7]. 

Static analysis methods are often more effective in the 
process of recovering data from storage. There are some 
advantages of this method such as: accessing and identifying 
the file system; recovering deleted files that have not been 
overwritten by other files; specifying the file type, using the 
file by keywords and appropriate pattern or MAC (Modify, 
Access, Creation) times, and carving relevant data from a 
larger portion of the raw data. This static analysis method 
forms the basis of most digital evidence recovery processes 
and is widely used by legal practitioners [8]. 

Static Acquisition is performed on electronic evidence 
confiscated by officers at the scene of a crime or submitted 
by the suspect [9]. Generally, this method is preferred by the 
investigator in collecting digital evidence because the 
process of data acquisition will not change the existing data 
on electronic evidence during the acquisition process [10]. 
Before performing the acquisition on the analytics 
computer, the write blocker is turned on first to prevent any 
data changes such as hash on the drive when connected to 
computer analysis. 

The challenge of the static acquisition is when it is in 
certain situations where the drive or the data-set is encrypted 
and read if only the computer is switched on and logged in 
with the owner's username and password, or if only the 
computer can only be accessed over remote network from 


the investigator. So the right solution for such case is to use 
Live Acquisition digital evidence collection method [11]. 

B. Live Forensics 

Live Forensics and traditional forensic techniques had 
similarities to the methods that are identification, storage, 
analysis, and presentation. But, live forensics was a 
development of the lack of traditional forensic techniques 
that could not get information from data and information 
that could be obtained if only the system was running such 
as memory, network process, swap file, running system 
process, and information from system files [12]. The 
principle is to save digital evidence in the form of process 
and any computer activity when it is on and connected to a 
computer network, as digital evidence on a computer that is 
on fire will be lost when the computer has been shut down 
[13]. 

IV. RESEARCH METHODS 

In this paper, the author proposes a methodology for 
conducting acquisition and analysis. It is expected to be able 
to obtain information relating to existing digital evidence in 
accordance with the case, the method can be seen in figure 1 
below. 



Figure 1. The Proposed Methodology 


In this study, the author also proposed different method of 
analysis of previous research method. This is due to the 
analytical step has its own part and way. The analysis is 
done under two conditions: first; analyzying the intact and 
undeleted files of the VirtualBox system, and second; 
analyzing the OS/application, registry, metadata and other 
logs to look for evidence and the reason ’’why the virtual 
machine that has been destroyed cannot be restored?”. The 
analysis schema proposed in this research can be seen in 
figure 2 below: 
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Figure 2. Analysis Flow 

To test the methodology and to support the research, 
required hardware and software as follow in Table 1. 


Table 1. Hardware and Software 


Hardware 

Software 

Laptop Core i7 8GB RAM 

VirtualBox 

HDD Toshiba 1 TB 

Autopsy 4.5.0 

HDD WD 60 GB 

Forensic Toolkit 1.71 


FTK Imager 


Regshot 


USB Writeblocker 


make the acquisition, this is to avoid operating systems such 
as windows to write data automatically and contamination 
of the original data to be acquired. 

In order to guarantee the authenticity of the result of 
imaging, it is necessary to record information from the 
acquisition process. Such information is begun and ended by 
acquisition process, hash value, and the size of the imaging 
file. MD5 hash 62d2cc945331bb43eb3b34e75df72430 and 
SHA1 hash 7721ebd99bd66d90b0fc7a0ce6d95e3da317c420 
after verifying result match. 

B. Recover VM Image 

In this research, Forensics Toolkit and Autopsy was 
applied to perform the recovery process of virtual machine 
image file that had been deleted along with important files 
in it. It can also open the image file from the virtual machine 
and then extracted and stored in the computer for the 
analysis process. The virtual machine image disk can be 
exported out by using it and then add to Open Case or New 
Case as a new device and create a single .vdi or .vmdk file, 
As if it is a complete hard drive containing the system 
information. By Recovering the deleted VM and extracting 
all the contents of the hard disk, file can make it easier for 
investigators to find out and access all the intact or deleted 
data that is on it. 

After the acquisition process, which can be done directly 
recovery is the second scenario, the scenario remove from 
the library on two files, Backbox 2.vmdk, and Windows7 
2.vmdk. Then the extract on the backbox 2.vmdk file uses 
autopsy to analyze the files in it, and hashing to adjust the 
original hash values and results of extract. The result, seen 
the value of MD5 Hash 

3eb4d201f6c40e6df74cf4b3bl25b8af which apparently 
match with MD5 Original Hash. While the file windows7 
2. vmdk after extracting has a value MD5 Hash 
80a691fbc5352c0786b8a7e92ed3a7c3 which also match 
with MD5 value Original Hash. 

C. Extract Filesystem Metadata. 


Case scenarios used to test the proposed method is to 
delete files on storage and hacking to analyze the browser 
history. The details can be seen in table 2 below. 


Table 2 Testing Scenario 


No 

Scenario 1 

Scenario 2 

Status 

1 

Windows 7.vmdk 

- 

Destroy 

2 

Backbox.vmdk 

- 

Destroy 

3 

- 

Backbox 

2. vmdk 

Remove From 
Library 

4 

- 

Windows7 

2. vmdk 

Remove From 
Library 

5 

Recover deleted files in 
backbox & windows 7 

Delete 

6 

Chrome history analysis on windows 7 (web hacking) 


V. ANALYSIS AND RESULT 

A. Acquisition and Imaging 

The method which applied was static acquisition 
method where the acquisition process is performed when the 
machine or device is switched off. An important step that 
should not be missed before the acquisition process is to 
install a write blocker on the device that will be used to 


a) Deleted Filesystem Extraction 

At this stage, extracted metadata filesystem on VirtualBox 
is to search for facts or evidence and digital traces of VM 
Backbox.vmdk and windows 7.vmdk which had been 
erased. After analyzing the deleted files using autopsy and 
FTK, we got some files from the plugin filesystem 
VirtualBox called vboxinfo.py which contained standard 
information about the file. From the information obtained, 
the file was created on 10-10-2017 at 19:15:06 WIB, at the 
same time the file was last accessed and modified, so from 
this file we can not find more specific information about 
traces of digital evidence searching for. 

b) Log Virtualbox Extraction 

The attack happened and any activities in the computer 
network can generally be stored in a log file that has a 
specific data format [14]. VBoxSVC.log is a Virtual Box log 
file and found in metadata file ID: 955 (S-1-5-21- 
1274466777-218395936-3752514298-1001) created on 19- 
10-2017 at 16:25:04 and last times accessed and modified at 
16:28:30 on the same date. 
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; locked for ses 


695-2d8d-4256-9c7c-cce4184fa048} aConqoonent*{Machine} aText={Machine is i 
sion (session state: Unlocked)}, preserve=false 

00:09:17.774074 ERROR [COM]: aRC=VBOX_E_OBJECT_NOT_FOUND (0x80bb0001> aIID={3295e 

6ce-b051-47b2-9514-2c588bfe7SS4} aCoiti>onent=[ExtPackManager} aText={No extension pack by t 
he name 'Oracle VM VirtualBox Extension Pack' was found}, preserve=false 

00:09:17.784655 ERROR [COM]: aRC=VBOX_E_IPRT_ERROR (0x80bb0005) aIID={480cf€9S-2d 

8d-425€-9c7c-cce4184fa048} aComponent={SessionMachine} aText={Saved screenshot data is not 
available (VERR_NOT_SUPP ORTED)}, preserve=false 
00:09:17.949290 


iRROR [COM]: aRC=E_FAIL (0x80004 
} aComponent={Medium} aText={UUID {dc782dlb 
\Users\tesis\VirtualBox VMs\Backbox\Backbox 



00:21:26.196252 ERROR [COM]: aRC=VBOX_E_OBJECT_NOT_FOUND (0x80bb0001) aIID={fafa4 

el7-lee2-490S-al0e-fe7cl8bfS5S4} aConponent={VirtualBox} aText={Could not find a registere 
d machine named 'Backbox'}, preserve=false 

00:21:32.349560 ERROR [COM]: aRC=VBOX_E_OBJECT_NOT_FOUND (0x80bb0001) aIID={fafa4 

el7-lee2-4905-al0e-fe7cl8bf5S54} aConponent={VirtualBox} aText={Could not find a registere 
d machine named 'C:/Users/tesis/VirtualBox VMs/Backbox/Backbox.vbox'}, preserve=false 
00:21:32.430060 ERROR [COM]: aRC=VBOX_E_OBJECT_NOT_FOUND <0x80bb0001> aIID={fafa4 


Figure 3. Log information about Backbox.vmdk 


From the logs, there is information that VirtualBox 
could not find the Backbox.vmdk virtual machine located in 
the Backbox directory where Backbox.vmdk is the core file 
of the VM Backbox that was being run in VirtualBox. The 
information printed in Figure 3 above indicates that the 
Backbox folder and its files had been deleted due to the 
VirtualBox system and the storage itself when destroyed 
through the VirtualBox app, then the registry file 
VirtualBox.xml was searched and analyzed that was located 
in the “.VirtualBox” directory and found out information 
like the following figure 4. 


: ExtraDataItem name=”GUI/GroupDefinitions/" value=”m=7649b3ec-bcf8-47a7-9a4e-00e4025 
ad383,m=c48c231S-d8eS-4319-993e-4c6730e2ee28"/> 

<ExtraDataItem name= , ’GUI/LastI'CemSelect:ed" value=”m=Backbox 2"/> 

' ExtraDataltem name="GUI/LastWindowPosition" value="SS8,98,770,S50”/> 

ExtraDataItern name=''GUI/SplitterSizes” value=”156,609”/> 

</ExtraData> 

<MachineRegistry> 

<MachineEntry uuid=”{7649b3ec-bcf8-47a7-9a4e-00e4025ad383}" src="C:\Users\tesisWirt: 
ualBox VMsXBackbox 2\Backbox 2.vbox"/> 

<MachineEntry uuid=”{c48c2315-d8e5-4319-993e-4c6730e2ee28}" src="C:\Users\tesis\Virt 
ualBox VMs\windows7 2\windows7 2,vbox"/> 

</MachineRegistry> 

<HediaRegistry> 

<HardDisks/> 

Figure 4. Log Registry Information VirtualBox 

While, information was found in the file 
VirtualBox.xml which states that the backbox folder, and 
backbox.vmdk files that is in Recent Folder and Recent List, 
which show that the folder and this file had ever accessed by 
the user. From this information, there is no trace that leads 
to the storage and there is possibility to return that 
backbox.vmdk file. 

In the same log file we also were found information 
that VirtualBox could not find the virtual machine located in 
the Windows 7 directory and found windows 7.vbox where 
it was a supporting file from VM windows 7.vmdk which 
ran in VirtualBox. Windows 7.vbox file was automatically 
generated when windows 7.vmdk ran on the road in a virtual 
machine. This information indicates that the Windows 7 
folder along with the files inside which had been deleted 
because of the VirtualBox system and the storage itself 
when it was destroyed through the VirtualBox app. While 
information about windows 7.vmdk file was not found in the 
log, in contrast to the previous backbox.vmdk file 
information. 

D. Analyze 

a) Virtual Machine Analysis 


uj i i.i u ig 


find out digital evidence and dig up more information about 
”what evidence can be obtained from the virtual machine?”. 
The main focused on data analysis in virtual machine file 
was a document (pdf, xls, doc, etc), image, video, browser 
history, virtual machine image, virtual machine log and 
other supporting data for case resolution according to the 
scenario. The result, in the Linux operating system 
directory, was found a folder "data rahasia" and 4 files that 
had been deleted could still be recovered, one of the file 
named image file 1 .jpg can be seen in figure 5 below. 



Figure 5. Recovery file imagel.jpg 

After analysis of the hex file image 1 .jpg, no 
suspicious information was found and no any information 
could be used as a guide to get new information related to 
the deleted VM. information which the image file had been 
processed using Adobe software photoshop 3.0.8. was found 
out from the Hex viewer While in the metadata file 
imagel.jpg which showed the data when the file was 
created, modified and last accessed by the user on that 
computer., see below. 

Accessed: 2017-10-14 10:40:15.096694368 (SE 

Asia Standard Time) 

File Modified: 2017-03-29 14:02:08.000000000 

(SE Asia Standard Time) 

Inode Modified: 2017-10-14 

10:50:37.980707478 (SE Asia Standard Time) 
File Created:2017-10-14 10:40:13.824694342 

(SE Asia Standard Time) 

The same Analysis of the Windows7 2.vmdk file was also 
done to analysis Backbox 2.vmdk by using autopsy tools by 
adding it as a data source in the case that has been created, a 
folder was found out After the analysis which was 
containing the files that was identified as digital evidence 
which was searched in the "data rahasia” directory, and get 
Image l.jpg that had been successfully obtained and after 
verification of the metadata and hash value MD5 
5baeafb27f3a4829dc0d4107c33e0cec matched with the 
original file. The final result and data verification of 
backbox 2.vmdk analysis process can be seen in table 3 
below: 


Two different tools, namely FTK (Forensic Toolkit), 
and Autopsy was used to analysis process DD file from the 
acquisition of the drive and the virtual machine image to 
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Table 3. Verify MD5 Hash Value 


File 

Name 

Original 

Hash 

Hash 

results 

Verify 

Backbox 

3eb4d201f6c 

3eb4d201f6c 


2.vmdk 

40e6df74cf4 

40e6df74cf4 

Match 


b3bl25b8af 

b3bl25b8af 


Windows 

80a691fbc53 

80a691fbc53 


7 2.vmdk 

52c0786b8a 

52c0786b8a 

Match 


7e92ed3a7c3 

7e92ed3a7c3 


Docume 

95cec3cdd85 

95cec3cdd85 


nt l.docx 

a2fb971 ffbb 

a2fb971ffbb 

Match 


a20883d709 

a20883d709 


Docume 

4692b14e43 

4692b14e43 


nt 2.docx 

3bf8251432e 

3bf8251432e 

Match 


6b62ebfba2f 

6b62ebfba2f 


Image 

5baeafb27f3 

5baeafb27f3 


i-jpg 

a4829dc0d4 

a4829dc0d4 

Match 


107c33e0cec 

107c33e0cec 


Image 

dl97558022 

dl97558022 


2-jpg 

3131faca26e 

3131faca26e 

Match 


6eb4092cafd 

6eb4092cafd 



Of the six files, after recovering and analyzing, then the 
MD5 Hash verification value was found that matched the 
original as in table 2 above, an analysis of the browser 
history was also performed by using chrome browser In 
Windows7 2.vmdk, from the analysis was found some 
evidence that there had been a process of hacking and 
manipulation of data on a high school website as in the 
figure 6 using the history viewer. 


URLHmofy 

ITitle 

N(pi:/Aww-o»g^.«.^ie4fch?q.lfln^ahr4ili-l Cl CHFX_wrJ. 

lambflghini ■ Pecwkis^an Gwgte 

Hfeps: //Liww.google.eo.ri/ie«fch?ii-lambof ghmtlitz-1 Cl CHF>!_?ri... 

lambogluni ■ Feowlynjan Gootfa 

Nip?:Cl CHR^enlD ?.. 

Kpkrtdb PprtrtuMaci 

Hip? wpkiJ'db com/ 

Dald>3?P by OHoreiyp Staaiily 


Nipt Qoogta to ki/£eMoh>q.Eware*bay^i^l2-1C1CHFX_ 

hup*/Awi wptoi-db.oom/dotJ' 
hrpi J/Www *y.pkrf-dfc. f 

fttpi-J/uMiw *xplM«db eom/w*bapp.i/ 

Ntp*:/A*w.p»gte.iM.Kt , *aafeh'3<i-diswfiiofld4fwrttlilz-l ClCHFX.. 
Nfcj3://wvm othwrtHckini^utwHto. com/2017/06/29/dovffto*dhiiv... 


Plenefcjsu*! Google 
Deofll of Sarwiofl Exploit a*xJ PoC ■ EopJo* Dat- 
Mlcttttft WVdOv#* ■ 'nriNltQy#vObi*Cf [Cl*«llh 
Wftb Appke*l.ort E xfJoki. PH P Ev0k*«, A5P E i 
Nawi 1 □ ■ SQL rnjacbisfi 
doiwbad hav? - Parwimiram Google 
Downlnd H**| 1 17 Fro Crack *d - SOL InmrfK 


Figure 6. Exploit-db search on chrome history 


From the results of history browser analysis using the 
history viewer was found traces of search a site commonly 
used to search for vulnerabilities of a website and site to 
download hacking tools by hackers. From the above history, 
it seemed the hacker open a site http://www.exploit- 
db.com/exploits/43101/ which contained tutorials and tools 
to perform SQL Injection attack on a target website. Traces 
of download of tool has been seem to be familiar among 
hackers to hacking the Havij 1.17 Pro from the site 
http://www.ethicalhackingtutorials.com/ where this tool is 
most likely used to find the username and password in the 



Figure 7. URL History Chrome Browser 
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From the figure 7 above shows that on 31/10/2017 had 
tried to login to the page administrator website 
http://www.smanlmasbagik.sch.id and manipulate data on 
an article with id = 25, if we look again at the history 
seemingly several times hackers tried to login to the 
administrator and also opened some menus that are inside 
the admin page. Beside hacking, the attacker also 
downloaded a hacking tutorial file from backtrack-linux.org 
which was used as a guide to do the hacking. 



Figure 8. Articles id = 25 before and after hacking 


After a search on the site 
http://www.smanlmasbagik.sch.id there was a change in the 
title of the article with id = 25 can be seen on figure 8, when 
it compared with the article accessed in August 2017 before 
was hacked, the title of the article is "5 Tallest Building in 
the World" , but after was being hacked on 31/10/2017 the 
title was replaced by the attacker to be "The 5th Highest 
Building in a World That Can Divert the World". From the 
history of the browser and the difference of article titles on 
the website proved that the computer owner used his 
computer to perform illegal hacking activities with Google 
hacking techniques. 


b) Registry Analysis 

Registry analysis which uses regshot captures two 
conditions, the first is the condition when VirtualBox is 
installed, and the second when the virtual machine is 
destroyed from VirtualBox. Then both the registry are 
compared to know the difference whether there is an 
addition to the registry which is done by VirtualBox 
application on windows. 

1. The first activity adds 4 values or adds a new virtual 
machine ie Backbox.vmdk in the registry m hku\s-i-5- 

21-1274466777-218395936-3752514298- 
1001\SOFTWARE\Microsoft\Windows\CurrentVersion 
\Explorer\RecentDocs\.hiv\2: 

",Windows7.vmdk"HKU\S-l-5-21-127446 6777- 

218395936-3752514298-1001 

\SOFTWARE\Microsoft\Windows\CurrentVersion\Exp 
lorer\RecentDocs\9 : 

2 . Then 2 values above are a cloning virtual machine 

process which is done in virtualbox. Cloning Backbox 
2.vmdk "HKU\S-1-5-21-1274466777-218395936- 

3752514298-1001\SOFTWARE\Microsoft\Windows 
\CurrentVersion\Explorer\ComDlg32\OpenSavePidl 

MRu\hiv \2 :" and Windows7 2.vmdk "hku\s-i- 5- 
21-1274466777-218395936-3752514298-1001\ 
SOFTWARE\Microsoft\Windows\CurrentVersion\Exp1 
orer\ComDlg32\OpenSavePidlMRU \ * \ 9: ". 

3 . Then value deleted are the registry which is created 

while destroying both virtual machines " hku \ s-i-5- 
21-1274466777-218395936-3752514298- 
1001\SOFTWARE\Microsoft\Windows\CurrentVersion 
\Explorer\SessionInfo\l\ApplicationViewManagem 
ent\W32:00 00 0 00 0 00 05 02E8\VirtualDesktop: ". 
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From the results of this analysis, we found the registry 
creation and deletion of files, but they could not be used to 
perform data recovery that hads been permanently deleted 
such as Backbox.vmdk and windows 7.vmdk. 

E. Report 

The purpose of this research is to find out the 
possibility of digital evidence can be recovered when 
deleted by the user in VirtualBox. In some cases of 
cybercrime, many criminals who used anti-forensic 
techniques to remove virtual evidence of virtual machines 
from VirtualBox, this research is trying to do recovery and 
analysis of the technique. 

The testing step was carried out to collect the required 
data related to VirtualBox and virtual machine on the 
experimental system, in an effort to learning and document 
the file and application structure. This testing stage was also 
included in the creation of 4 pieces of a virtual machine with 
different OS. 2 OS for Destroy and 2 other OS to remove 
from library. Data collection was conducted at every step of 
testing in the form of monitoring application either monitor 
registry, log and etc. whether there was any change in 
management done or was’nt, then did the acquisition on 
hard drive to do analysis and recovery. 

The destroyed virtual machine recovery process failed 
because of the structure and characteristics of the virtual 
machine itself, as well as the data deletion method which 
was used by VirtualBox to delete files on each of its bit. It 
was in contrast to the common removal in the Windows 
operating system which deleted it by moving to recycle bin. 
The process of removal of VirtualBox can be almost same 
as doing erase or wipe of data to the document, more details 
need to understand the structure of the hard drive first. 

Simply, the hard drive consists of 2 main parts of the 
File Table and Data Center. File table was known by term 
MTF (Master File Table), denotes a hard disk structure 
which maps the locations where a file resides, just like a 
map or table of contents in a book. While the Data Center is 
the real location where the data is stored. When we delete 
files in the normal way which is used on windows, then 
what happen is a record file or table of contents which have 
removed from the File Table along with other details such as 
(size, location, etc.) while the original data is still stored in 
the Data Center until overwrite by other files. This is the 
reason why the destroyed file (erase/wipe) cannot be 
restored again, while the files which is tested Resmove 
From Library VirtualBox and Delete on windows can still 
be recovered because the recovery just a list of contents. 

Overall, the acquisition, recovery, and analysis of 
VirtualBox which had been deleted from libraries was 
considered to be a success even if the destroyed VirtualBox 
failed to restore, but there were some files leading to the 
deleted evidence and were possibility to be recovered and 
analyzed. The failure to perform recovery on this destroyed 
VM asserted that the removal of a virtual machine is an 
effective way to destroy digital evidence. The final result 
and data verification of analysis and recovery process can be 
seen in table 4 below: 


Table 4. Result Analysis And Recovery 


Name 

Conditi 

on 

Recover 

able 

MD5 Hash 

Result 

Backbox 

.vmdk 

Destroy 

No 

Unknown 

Unreco 

verable 

Window 
s 7.vmdk 

Unknown 

Unreco 

verable 

Backbox 

2. vmdk 

Remove 

Yes 

3eb4d201f6c 

40e6df74cf4 

b3bl25b8af 

Match 

Window 

s7 

2. vmdk 

80a691fbc53 

52c0786b8a7 

e92ed3a7c3 

Match 

Docume 

nt 

l.docx 

Delete 

Yes 

95cec3cdd85 

a2fb971ffbba 

20883d709 

Match 

Docume 

nt 

2.docx 

4692b14e433 
bf8251432e6 
b62ebfba2f 

Match 

Image 

i-jpg 

5baeafb27f3a 

4829dc0d410 

7c33e0cec 

Match 

Image 

2jpg 

dl97558022 

3131faca26e 

6eb4092cafd 

Match 


VI. CONCLUSION 

Based on the results which was obtainned in the 
discussion of recovery & analysis on the virtual machine 
which is deleted using the method of Virtual Machine 
Forensic Analysis & Recovery can be concluded that, the 
virtual machine which was removed from the VirtualBox 
library could be recovered and analyzed by using autopsy 
tools and FTK with analytical methods has been proposed 
before. 4 deleted files in the VMDK file could also be 
recovered and analyzed against the digital evidence, and 
after checking the hash and metadata in accordance with the 
original, a browser history analysis was also carried out In 
windows7 2.vmdk showing the evidence that the attacker 
also did a hacking crime against a website. 

Virtual machine image with Windows-based and Linux- 
based operating systems which was deleted by the destroy 
method on VirtualBox cannot be recovered using autopsy 
and FTK, even though VirtualBox log analysis, deleted 
filesystem analysis, and registry analysis to recover 
backbox.vmdk and windows 7.vmdk does not work, because 
the deletion was done using a high-level removal method, 
almost similar to the method of wipe removal of data on the 
hard drive. Removing a virtual machine with this destructive 
method is very effective for anti-forensics because it can 
complicate investigators to recover and analyze evidence. 

VII. FUTURE WORK 

This research is far from a perfect, there are some 
suggestions that need to be done in the development of 
further research: 

1. The next research needs to be done in-depth analysis of 
the Registry and Memory to find out how the process of 
data deletion occurred. 

2. This research only focused on Destroy and Remove 
From Library, did not to discuss about snapshot, further 
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research can add discussion with snapshot virtual 
machine. 

3. It is expected that in subsequent research the simulation 
of crimes carried out can be applied by using a more real 
case. 

4. This research used FTK and Autopsy tools for 
acquisition and analysis, it is expected in subsequent 
research using EnCase Forensic tools able to find out 
possibility to return VM image that was destroyed. 
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Abstract — In order to prevent heritage and important sites, an inspection was carried out to identify different tools 
and techniques for 3D modeling. It was observed that a good quality work is done all over the world, related to 3D 
modeling. Different tools and techniques where adopted by different researchers along which different data acquisition 
were used. It was studied that different tools are suitable to work with different type of 3D model generation. 

Keywords — 3D modeling, heritage, important sites, digitization, LiDAR, passive, active, photogrammetry and 
point cloud. 

1. Introduction 

3D technologies are becoming popular in management, digitization and conservation of not only historical 
but for the monument which holds a national importance. A 3D visualization help in planning house, town, city, 
buildings and many other. The 3D representation presents object in a much effective way as compared to 2D photos. 

As the time is passing, heritages/important sites such as Natural, cultural or mixed are moving towards 
degradations due to ecological effect and artificial changes as wars, natural disasters, climate changes and human 
negligence and one day the places will be vanished [16]. This is leading, an increasing demand all over the world to 
preserve these important sites. Digitization (3D model) provides a ways of preventing these important sites. 3D 
modeling and texture mapping uses point cloud (unstructured) and polygonal model (structure) [18] and provide 
virtual view of sites. 

UNESCO has listed more than 1073 world heritage/sites. Among which at international level Italy has the 
highest count of heritage sites. Nepal is also one among them and extremely sensitive to natural disaster [16]. Not 
only these but other countries like Turkey [2, 38], Greece [28], Italy [31, 36], India [48] and many other countries are 
also moving towards digitization of heritage sites, in order to prevent them from degradation. Even India has digitized 
many historical sites, buildings and statues not only for the sake of prevention, but also to provide a realistic virtual 
view and to attract tourists. 

Several ways are adopted by a researcher for the development of the 3D model in which keeping track of the 
changes in the heritage/important sites was one of the major concerns. To create a 3D model of archaeological and 
culture heritage site powerful tools and techniques are required Many techniques are available which are able to 
capture and make digital model with Geo-referenced details of sites [1]. Different sectors are benefited by using 3D 
modeling like education, preservation of ancient places, civil engineers, interior designers marketing, town planning 
and for many other things. 

The basic objective of this paper is to showcase the advancement achieved in the field of 3D modeling as 
well as to represent digitization of accomplished heritage and important monuments through the 3D modeling. 

2. Overview 3D Modeling 

Throughout the last decades, a number of researchers have been addressing the use of Photogrammetry to 
develop 3D models for creating the virtual reality of cultural heritage sites [7], important places and sculptures. 
Broadly the methods used for this purpose were distributed into two methods as 

I. Image Based method also knows as Passive Techniques 
II. Range based methods known as Active Techniques 
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These techniques can be combined together to determine the 3D geometry and texture of the model [3]. 

There are certain notable 3D projects created, which is made renowned in Table 2, Table 3 and Table 4. 
Researcher had developed 3D model of cultural heritage sites in numeral ways. They are fascinated in recording the 
sites and keeping track of changes is one of the objective of many researchers [8]. 3D modeling serve as a complement 
for archaeologist’s documentations [9]. 

Documents such as written document, sketches, drawings, paintings, diagrams and images were used for 
visualizing the data and present the physical state of archeological site. Now a day’s number of software’s are used 
by researchers for visualizing the data in 3D form [10, 11]. This 3D modeling presents virtual reality environment for 
visualizing the heritage but also for buildings, places, campus and so on. The three fundamental steps for 3Dmodeling 
are: [12-15]. 


a. Data Acquisition: A step for collecting data to create a 3D model is recognized as data acquisition. 
It involve a number of equipment’s used for data acquisition and can be categorized on the bases of image based and 
range based data. 

b. Data Registration: To create a 3D model, images or point clouds are essential which completely 
covers the entire objects surface to provide realistic view. Point cloud from multiple images need to be registered in 
same coordinate system. For this registration purpose Total Station and GPS systems are used. They provide pairwise 
or multi scan registration through ICP algorithm. 

c. 3D Model Generation: The 3D model is generated from the sets of registered images or point 
clouds that represent the state of the object at the same point. 3D is generated using 3D software such as Trivim, Maya, 
AutoCAD, VripPack, Autodesk 3ds Max and so on. 

3. Data Acquisition Sensors 

The 3D digital acquisition of the object and the structure is mostly conducted by means of 

1. Active Sensors 

2. Passive Sensors. 

Integration is depend on the required accuracy, object dimensions, location of sites, tools, usability, surface 
characteristics, team experience, budget of project and goal of final survey [23]. 3D modeling required computer 
graphics software such as sketchup, 3D Studio max, Blender, CityEngine tool, PhotoModeler, Photo4, 
Imagemodeler4,1 witness (accident reconstruction), Trivim and so on which are signified in table 2. 

Photogrammetry can be split into Far Range Photogrammetry and Close Range Photogrammetry (Multi-station 
convergent Photogrammetry) [24]. It is classified based on the object and camera distance [25]. Far range 
Photogrammetry is space base (satellite camera access are vertical or slightly of nadir) and aerial base (camera access 
is vertical) Photogrammetry where the object and camera distance is quietly large [26]. Close range Photogrammetry 
is terrestrial Photogrammetry, where the object and camera distance is small [27]. Terrestrial sensor camera access is 
horizontal and usually placed on ground [28]. 

3.1. Active Sensors (Range Sensor) 

It is pulsed (time of flight), phase shift and triangulation instruments [29]. It record the 3D geometry information, 
point clouds in the Field-Of-View (FOV) with geo-referenced data [30]. Terrestrial range sensor (SAR and close 
range) used as very short range and pre-defined wavelength multispectral laser scanning allow to the identification of 
object material, humidity and moisture of the object (targets) [31]. Long range terrestrial laser scanner used (Time of 
flight, phase shift) different FOV, sensor weight, wavelengths range, angular accuracy and different megapixel 
cameras such as 0.2-3000 cm, 0.6 mm-150 m range accuracy, 532 nm, 660 nm, 670 nm, 690 nm, 785 nm, 905 nm, 
1064 nm, 1535 nm, NIR and VIS wavelengths, 1 megapixel camera to 70 megapixel camera. Table 1 details the 
information. 

Airborne Laser Scanners (ALS) sensors also used airborne platform (helicopter or wing aircraft) [32]. It is called 
as LIDAR (Light Detection and Ranging) sensor [33]. ALS is combination of Global Navigation Satellite 
System/Navigation Satellite (GNSS/NS) sensors to measure the accurate position and generate Digital Surface Model 
(DSM), Digital Terrain Model (DTM), city modeling, forest applications, corridor mapping, structural modeling, 
change detection, heritage documentations, archaeological applications, landslide, glacier monitoring, man-made 
structure, etc. Optical RS images are limited available specifically for aerial images [34]. The satellite imaging is 
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affected by sensors viewing angle (across-track & along-track), sun acquisition angles, atmospheric condition and 
saturations [35]. Disadvantage of range based modeling is high cost of software and hardware [36]. 

3.2. Passive Sensor: 

Passive sensor is the digital camera images and active sensor is laser scanner or radar [20]. It provides 3D 
information. Terrestrial (Active and passive) sensor used as find 3D shapes from 3D imaging techniques [21]. 
Synthetic Aperture Radar (SAR) sensor is not considered as optical sensor. 3D survey and modeling based on the 
digital recording, passive sensors, range data, active sensors, classical survey, 2D maps and integration methods [22]. 

For obtaining 3D information at least two images (2D images) are required [37]. Photogrammetry and computer 
vision are the best techniques. It is used in case of loss of object data, architectures, small object shape, and point 
analysis and low budget terrestrial project. Image based modeling has low cost software and hardware [41]. The 
terrestrial sensor used as different digital cameras (such as DSLR) up to 10-60 megapixels. The mobile phone is also 
used for photogrammetry [46]. Aerial acquisition platform UAV (Unmanned Aerial Vehicles) is used low altitude 
model (helicopters). Airborne digital cameras used different system as small format, medium format and large formats 
with spectral bands (RGB, PAN and NIR) and geometric resolution [48]. It used very high resolution satellite images 
(<15m). ALS is a combination of different components to the final accuracy of the range data. Terrestrial scanning 
can create several problems such as size, shape, locations, rough/sloped surface (geometry & material), unfavorable 
weather condition, light and missing part [50]. ALS is based on direct geo-referencing. Aerial and terrestrial scanning 
used sampled distance. 

Data fusion is the standard framework for 3D modeling such as [51] 

1. Optical satellite data combined with radar data 

2. Panchromatic data with multispectral data 

3. Aerial data with LiDAR data 

4. Terrestrial scanning data with photogrammetric data. 

Range imaging cameras and mobile mapping also used integration fusion of sensor [52]. It is cost effective 
acquisition of geo-referencing spatial data with long range laser scanners and Global Navigation Satellite 
System/Inertial Measurement Unit (GNSS/IMU) positioning sensors. It is developed for academic research such as 
topographic surveying, 3D mapping of traffic arteries, city planning, visual street vector data and visualization [53]. 

4. Tools and Technique used for 3D Model Generation 

Generation of 3D mode can be complete with the help of different tools and techniques. Different researchers 
have adopted different software’s and techniques for the creation of model, which is being particularized in the Table 
1. Some of the most frequently used tools and techniques are detailed below 

4.1. CAD 

Computer-Aided Design (CAD) technology is useful for creating designs and documentations. It allows to make 
drawings of 2D and 3D models. It allows to make drawings of 2D and 3D models. This technique replaces manual 
drafting with automated process. CAD helps in making drawings which are proportional to geometry of model. This 
technique helps to explore the ideas, designs photorealistic rendering and how the design will be in real world or vice 
versa. CAD programs are run in software’s and AutoCAD is the first software to implement this and most widely used 
application [19]. 

CAD include Autodesk product used to create factory design, products designs, models of cars, buildings, ships, 
motorbikes etc. and provides the facility of rotating the model. It provides more accuracy for these models. But this 
software have a high price (can try open source software’s of it) and need to update the software after every release of 
the new package. 

4.2. Sketchup Pro 

Sketchup Pro is used to create sketches of 3D models of furniture, interiors, landscapes, buildings and many more 
and act as an interface to reflect the way of working. It is a commercial software which provides animations, scenes 
and printouts, with realistic light and shadows. It allows to print the model on 3D printer. It also imports CAD files 
and exports CAD as well as pdf files. Its interface is designed to be easy and simple for its users. Sketchup Pro is used 
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to produce scaled and accurate drawings. Models created with this software are highly detailed, Geo-Locate and 
accurate models. Useful for presentations of model. It supports and make hand-drawing rendering style. 

Sketchup Pro provides tools to convert drawing to 2D presentation. Benefit of using Sketchup Pro is in 
architectural and internal designs. Easy software to model drawing lines, rectangles and circles even pushing, pulling 
and modeling it. External extra extensions are need to be added from extension warehouse for rendering or need to 
download extra software and then render the model. 



(a) 


(b) 


Figure 1: Sketchup Pro Output, (a) 2D image (b) 3D model of same 2D image [59]. 

4.3. PhotoModeler 

This software helps to create 3D models from images which are taken from an ordinary camera. It provides the 
facility of measurement of 3D models and is a cost effective way to provides accurate 3D scanning, surveying, 
measurements, realistic capturing of data and realistic view of the model. It is used mostly for reconstructing accidents 
and for trade shows. It provides the texture to the model as it is in the original photograph and the 3D models created 
can be exported with this texture. It not only exports the data but is capable to import 3D data for matching, 
compression and control points in the solution and the accepted files have a specific file format as 3D studio, 
Wavefront OBJ, DXF, 3DS and Raw Text files. This software act as black box and do not provide detailed information 
for the accuracy of calculated parameters of internal and external orientation of the model and even the resulting output 
model. Doesn’t provide demonstrating of the models with high accuracy. 



Figure 2: 3D out form Of PhotoModeler software [61]. 
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4.4. Agisoft PhotoScanner 

Agisoft PhotoScanner software performs photogrammetric processing of digital images to create 3D spatial data 
useful in GIS applications, heritage sites and visual effects. It is also beneficial for indirect measurement of objects 
having different scales. It is capable of and produce a quality and accurate results in the form of 3D model. Automatic 
& smart processing of the data is done. Terrestrial and aerial images are processed by this software. It produces Ground 
Control Points (GCP) during processing and applies SfM-MVS algorithm for processing. 

4.5. MeshLab 

It is an open source software which process and edit 3D triangular mesh. A set of tools are provided by MeshLab 
for editing, rendering, healing, cleaning, inspecting, texturing and converting Meshes. Raw data is processed by the 
features provided by MeshLab to produce 3D digital model. These models can be printed. Can handle large files. The 
problematic thing with this software is that options for this tool are not user friendly. 

4.6. 3D Mesh Model 

This technique uses build structure of 3D Model which contains polygons. It is represented by three axes which 
define the shape of object by height, width and depth. The product of photogrammetric 3d modeling is called as mesh 
model and is triangulated from original point cloud. It is widely used for visualization purpose as it can be textured 
and give photorealistic view. It is useful for conserving heritage sites but also used for online virtual tourism to attract 
tourists [16]. 

4.7. Structure from Motion (SfM) 

SfM is a photogrammetric technique used for creating 3D models and creates ortho-image from a series of 
overlapping photographs. It is a low cost photogrammetric method useful for reconstructing high resolution 
topography. SfM is processed to estimate 3D model (structure) from a 2D image/picture. This model can produce 3D 
models from snap of series of photographs even from the camera of smart phones, UAV or other photogrammetric 
equipment’s [18]. In this technique a set of set of photographs are texturally analyzed for finding key points among 
these multiple images. These points are used for linking across the photoset and sparse point cloud is the result, from 
which camera positions can be calculated. Complicated while working with internal parameters. 

4.8. Digital Elevation Model(DEM) 

DEM is the digital representation of terrain's (also known as topographical relief which involves the vertical and 
horizontal dimensions of land surface) surface. It can also be called as a elevation of gridded arrays or elevation of 
earth’s surface about a certain level (datum). DEM can be derived from different sources as LiDAR [11], IFSAR, 
Photogrammetry and other less commonly used sources. DEM’s quality is dependent on its horizontal and vertical 
accuracy. It is measured as an average discrepancy between surveyed positions and sample points of the grid. It is 
useful in many applications as 3D modeling, Google Earth, Navigation, engineering etc. There are two elevation 
measures: at regular spaced points and irregular spaced points at earth’s surface. DEM cab is described with the help 
of various acronyms but two of them are DSM and DTM. Dem have limited adaption to topographic structures. 

4.9. Point Cloud (PC) 

PC is a 3D modeling technique which is a set of data points in 3D coordinate system. By using 3D Scanners point 
clouds can be created. Large number of points is measured on the objects surface by using the devices. Point cloud 
means detection and picking of homologous points in different scans. These points can be processed after registration 
which can be done by using available software’s as Z+F Lasercontrol@ [24]. Accurate point clouds could be obtained 
by considering the distance accuracy and space resolution of the laser scanner. It also involves checking of data points 
and removal odd bad point clouds. All these point cloud are registered in a single coordinate system for complex 
visualization model by selecting filtration process which results as a mesh (polygon mesh or triangle mesh) [21]. It 
may have false positive presence of points. Template matching is also approximate correct as it is containing many 
views near its view point. 


12 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 2, February 2018 


4.10. Terrestrial Laser Scanning (TLS) 

TLS is a getting more popular now a days because of its features. Terrestrial Laser Scanners are the devices used 
for contact-free measurement which collects dense point clouds from object. TLS acquire dense set of 3D points. This 
technique is important for surveying application as it is a surveying instrument which massively captures coordinates 
of ground points in 3D with high velocity and accuracy. TLS is used to capture high detailed architectural geometric 
data and can easily collect data from unreachable places in minutes. It has become popular for mobile robot navigation 
to construct metric scale 3D model [37]. Long processing time is required for creating the surface model and have 
varying accuracy. Its accuracy depends on the camera calibration points and are easily effected by coarse image 
acquisition geometry. 

4.11. Digital Surface Model (DSM) 

DSM represents the first view of the earth’s surface sensed by the remote sensing sensor and also to measure the 
height value of the first surface of Earth in ground. The view provide by this is as vegetation, buildings, houses, 
factories and the entire feature which comes above the Bare Earth. 



Figure 3: Digital Surface Model of Motorway interchange construction site [62], 

It is useful in 3D modeling, urban/town/city planning, telecommunication, determining obstacles in runway, 
management of vegetation etc. It is generally referred by LiDAR data [43] contain details of all land surface. LiDAR 
DSM reveals details of extensive geomorphology. By mapping aerial photographs LiDAR DSM are recorded. For 
high resolution large number of points are needed. 

4.12. Digital Terrain Model (DTM) 

DTM is elevation surface which represents the vertical datum of bare earth. It is a vector data having regular 
spaced points. It provides topographic model of the bare earth’s surface or terrain of the underlying surface of earth. 
This is usually derived from DSM, by digitally removing the cultural, man-made and vegetation features from DSM. 
DTM is used in range of GIS and CAD formats. The quality of this technique is measured as the accurate elevation at 
each pixel and accuracy of the presented morphology. The impact of migration are not included [43]. 

4.13. Triangulated Irregular Network (TIN) 

TIN model represents the surface as a set of non-overlapping and contiguous triangles. Surface is represented by 
a plane within the triangle. This triangle is made of set of points called as mass points. It has the ability to describe the 
surface at different level of resolution and is much efficient in storing data. Sometime it requires manual control of 
the network and visual inspection [43]. It have a very complicated data structure with limited probability for analysis. 

4.14. 3D Surface Modeling (3DSM) 

3DSM is used to reconstruct 2D Scanning Electron Microscope (SEM) images in 3D model. It is a mathematical 
representation of solid objects. It can create associative and NURBS surfaces. This technique is used for animation 
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for games, movies, presentation and other things. It is difficult to construct and calculate mass property of the object.as 
well as more time is required for creation and manipulation. 

4.15. Autodesk 3ds Max 

Autodesk 3ds Max is a powerful tool used by developers for making games, graphics designs and providing visual 
effects and is one of the most popular tool. It is useful for 3D modeling, animation and rendering. This tool helps to 
create any type of model and basic goal is to deal with high complex works. 

This tool is written in C++ language and is found in more than 40 different languages. It provides huge tool for 
3D modeling and creates life-like 3D models. 



Figure 4: Autodesk 3ds Max [60]. 

Perspective matching, particle animation, vector map support, 2D pan and zoom and also many more features are 
provided by Autodesk 3ds Max software. 

Difficulties tackled while working with this software is that, it is hard to learn and understand. Due to its 
difficulties it is time consuming but work very well for complex structures. Rendering process is slow but can be 
overcome by using external add-on’s rendering tools. 

Table 1: Optical Remote Sensors and Cameras used for Data Acquisition 


Sensor 

Camera 

Scale/ Megapixels 

Ref. 

Opto Top-HE, opto Top- 
SE 

CCD Leica Digiluxl, 
Sony DSC-W60 

4mpx, 6mpx 

1 

IKONOS 

Canon PowerS hot A530 

1:5000 (cadastral-1:1000, 

topographic-1:5000) 

2 

Laser scanner 

Camera 

4 and 5 mpx 

3 

Lidar 


1:50000 

4 

Cyrax 2500 laser scanner, 
Leica TCR 705 Total 
Station 

Nicon D100 


5 

Optech 1205 ALTM Lidar 
and CASI 

(40 km) 

288 spectral bands 

6 

CASI & Thermal Imaging 


1:500, 1:1600 

7 

ALTM 2033 Lidar 


1047nm, NIR 

9 

Lidar 


1:10000, 1:1000, 1:24000 

10 

Optech ALTM 



11 

NCALM 


2 m resolution, l-4m 

Wavelength 

12 


Iphone 3G, Nokia N900 
phone, 6 DOF monocular 
Camera 

320*240 pixels 

14 

Leica HDS P20 

Canon EOS 600D, 

Huawei Nexus 6P 

Sensor size- 22.3mm, 6.2mm 

16 
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Total station 

Sony DSLR A700 

4272*2848 pixels 

17 

HMD 


1080*1200 

18 


Canon IIXUS175 

20 mpx 

19 

Laser scanner Faro Focus 

3D 



20 

Laser scanner Z+F Imager 
5006h 

Nikom D500 SLR 

8mm Sarnyang lens 

21 

FOG, MEMS 



22 

Lidar 



23 

Laser scanner Z+F Imager 
5010 



24 

T2 Wild theodolithe & TC 
705 Total station 

Close-range 

Photogrammetry 


25 

IKONOS & Quickbird 
stereo image 



26 

IKONOS 



27 

ILRIS 36D Optech Laser 
scanner, Leica TCR305 
Geodetic station (500mm) 

Canon EOS 400D 

2MP, 10MP (3888*2592 pixels) 

28 

FARO Focus 3D Laser 

scanner 

Camera 

70Mp 

29 

TOPCon 3005N (TS), 

Leica HDS 7000 LS 

Canon 60D 

18Mp 

30 

UAV—MD4-1000, Leica 
HDS 7000, Z+F 5600h 

Olympus E-Pl, Olympus 
x2-l, M-Cam camera, 
Nikon D3X 

12Mp, lOMp, 5Mp 

31 


Canon 550D, Nikon D200 

200m lens 

32 

Konica Minolta VI910 
Laser scanner 



33 


Kodak DCS 

13.5Mp 

37 

IKONOS 


1:5000 scale 

38 


Kodak CX7300 


41 

TOPlOvector 


1:10000 scale 

43 


5. Related Work 

A lot of work is correlated to 3D modeling. By observing Table 2, it is depicted that internationally a lot of work 
is covered in 3D modeling by adopting different software’s and techniques as per the need. Table 3 illustrates that a 
good amount work is completed nationally (India). The work performed at regionally that is in Marathwada region is 
elaborated in the Table 4. 

6. Challenges 

Disputes faced by 3D modeling are large site, complex objects, selecting the appropriate methodology (sensors, 
h/w & s/w), data processing, proper workflow, correct final result of given technique and accuracy [54]. 3D modeling 
term is used in terrestrial applications. The mapping is used in aerial domain [55].Issues of research [56-61]; 

• New sensors and platforms are frequently coming on the market. 

• Integration of sensors and data fusion of sensors (combination of sensors). 

• Automated processing - point cloud processing depend on many factors. 

• Each instrument has its own file format and specific software’s. 

• Online and real-time processing. 

• Feature extraction- information extraction to be more reliable, precise and effective. 

• Improvement of geo-spatial data and content, number of user and demand of data is also more. 

• Development of new tools for non-expert users, 3D recording is interdisciplinary task, non-technical users 
are not understood the software or packages. 

As per the literature review of 3D modeling, there are various newly developed sensors, methodologies, 
techniques, algorithms, H/W and S/W. Photogrammetry is provided accurate 3D reconstructions with different scales 
and combination of 3D models. 
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Table 2: Digitization of Heritage Sites performed at International Scale with Software’s, and Techniques 


Software’s 

Techniques 

Important Site 

Ref. 

Geo-Magic Studio v.6, 

IMView, IMMerge, IMEdit, 
Blender, Arius 3D 

LS3D method, VCLab’s 3D 
scanning tools 

Italy 

1 

CAD, Mapinfo 

DEM 

Safranbolu, Turkey 

2 

CAD, Poly works 

Facade, CCD 

Abbey of Pomposa near 
Ferrada & Scovegni Chapal in 
Padova 

3 

IKAW 

DEM 

Netherlands 

4 

ArcGIS, CAD (AutoCAD), 
VripPack, IBM ViaVoice v.10 

VR (Video) 

Monte Polizzo in Western 
Sicily 

5 


NDVI 

England 

6 

Automated S/W 

DEM 

Cherwell Valley in North 
Oxfordshire 

7 


Kriging, Filtering 

Netherland 

8 

Golden, Surfer Surface 

Kriging, High-Pass Filter, 
DTM, DSM 

River Trent 

9 

ArcGIS, IDL package 

DEM 

Chesapeake Bay in state of 
Maryland 

10 

Trimble Navigations GPSurvay 
v.2.35a, GeeoLab v.3.61, 
WayPoint Grav NAV v.6.03, 
ArcGIS 9.1 

DEM and Bare-Earth Models 

Isle Roy ale National Park, 
Michingan 

11 

Terrasolid’s Terrascan, ArcGIS, 
Matlab 

Kriging (DSM-lm, DEM- 
50cm), SVF 

Central Yucatan, Mexico 

12 

GUI Programming 

Plane estimation Algorithm, 
PTAM, EKF-SLAM 


14 

zWarper, CAD\ CAM, Maya, 
3DStudio Max, Cinema 4D 

Voxel based Techniques 


15 

Bentleys ContextCapture, 

VisualSfM, AutoCAD, 

CityGML 

Mesh Model 

Kathmandu valley, Nepal 

16 

Orthophotomo s aic s 

3D surface model 

Cape Glaros & Metohi, 
European 

17 

AgiSoft PhotoScan 

SfM, MVS 

Mazotos Shipwreck in Cyprus 

18 

PhotoModeller, AutoCAD, 

HDR, ArcGIS 

RMSE 

Bulgaria 

19 

Agisoft Photoscan, SURE, 
MicMac, PMVS, Zephir Aerial, 
Maya, 3D Studio, Rhinoceros 

SfM 

Italy 

20 

Cyclone 8.1, GeoMagic Studio 

Point cloud 

Statue of Leonardo da Vinci in 
Piazza della Scale milan 

21 

MMS 

3D modeling 

Canada 

22 

WebGL API, OpenGL, EbGL 
API, JavaScript, OpenLayers3 

TIN model 

Norway 

23 

Z+F LaserControl, MeshLAb, 
CloudCompare 

Filtering, Point Cloud 

Italy 

24 


DTM 

United Arab Emirates 

25 

Cinema 4D (C4D) 

DEM, TIN 

Tuekta, Germany 

27 

Iwitness, ShapeCapture, Parser, 
Innovmetric polywork II and 
Erdas Imagine 9.2 

Point cloud, ICP and 

Basesight position orientation 
algorithm 

Roussanou Monastery in 
Meteora, & Dispilio, Greece 

28 


16 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 
































International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 2, February 2018 


CHDK, Visual SfM, Agisoft 
Photoscan, Polyworks, 

AutoCAD 

SfM algorithm 

Byzantine walls of Aquileia, 
Italy 

30 

Bundler, Meshlab 

SIFT & SURF algorithm, 
SVD method 

Mazotos area, Cyprus. 

32 

Arc3D 

web-service 

LAquila Museum, Italy 

33 

ArcGIS, Autodesk 3ds Max, 
AutoCAD 


Yildiz Technical University 
campus Information System, 
Istanbl, Turkey 

34 

Maya, 3DSMax 

CAD modeling, ICP 

Algorithm 

Angkor Wat temple, 

Cambodia 

36 

3D Studio, Maya, SketchUP 

TLS 

Italy 

37 

CAD drawing 

Terrestrial Photogrammetric 

Safranbolu, Western Black 
sea Region, Turkey 

38 


MMI algorithm, DLT method 

Italy 

39 

CyberCity, ScanLet, 3ds max, 
sketchUP, Maya 

DTM, web-based 

visualization 

Brandenburger Tor, Berlin, 
Germany 

42 


DTM, DSM, TIN 

Prins claus plein, Netherlands. 

43 

Top VRML, Matlab 6.0 

DTM 

Paterna, Valencia, Spain 

44 


RSV, Segmentation, ICP 
algorithm 

Pierre, Beauvais, France 

46 

CAD 

Sketch algorithm 

Sagalassos, Turkey 

47 


Table 3: Digitization of Heritage Sites performed at National Scale with Software’s and Techniques 


Software’s 

Techniques 

Important Site 

Ref. 

CuberCityModeler, Sketchup 
Pro, CityGML, RFM 

DEM, RANSAC & Epipdar 
frame Algorithms 

India 

26 

FARO Scene 

Rendering Virtual model 

Dept of CS & IT, Dr. BAMU, 
Aurangabad, India 

29 

SketchUP, AutoCAD 

Different file formats 

IITB, Pawai, Mumbai, India 

35 

Photomodeler 5 

RMS 

IIT, Roorkee, Uttarakhand, 
Haridwar district, India 

41 

CAD, ArcGIS 

DEM 

Hampi, Karnataka, India 

48 


Table 4: Digitization of Heritage Sites performed at Regional Scale with Software’s and Techniques 


Software’s 

Techniques 

Important Site 

Ref. 

FARO Scene 

Rendering Virtual model 

Dept of CS & IT, Dr. B AMU, 
Aurangabad, India 

29 

SketchUP, AutoCAD 

Different file formats 

IITB, Pawai, Mumbai, India 

35 


7. Conclusion 

3D modeling and digitization is an important aspect, which provides the facility of preserving historical/important 
monuments. The advance in tools and techniques bring the technology transfer for digitization. According to the study 
above, it is observed that there are various tools and techniques available through which 3D model of different objects 
can be created. These software’s are good while working with 3D modeling apart from which each of them provides 
better results for a specific purpose. CAD works better for factory designs, models of car and other product. Sketchup 
Pro is used for internal designs and architectures. PhotoModeler can be utilized for reconstructing accidents and trade 
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shows. Whereas Autodesk 3ds Max is used for complex and large projects. Even many other software’s are also 

available which can be cast off for 3D modeling having their own rewards and shortcomings. 
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Abstract — the growth of internet of things and wireless technology has led to enormous generation 
of data for various application uses such as healthcare, scientific and data intensive application. Cloud 
based Storage Area Network (SAN) has been widely in recent time for storing and processing these 
data. Providing fault tolerant and continuous access to data with minimal latency and cost is 
challenging. For that efficient fault tolerant mechanism is required. Data replication is an efficient 
mechanism for providing fault tolerant mechanism that has been considered by exiting 
methodologies. However, data replica placement is challenging and existing method are not efficient 
considering application dynamic requirement of cloud based storage area network. Thus, incurring 
latency, due to which induce higher cost of data transmission. This work present an efficient replica 
placement and transmission technique using Bipartite Graph based Data Replica Placement 
(BGDRP) technique that aid in minimizing latency and computing cost. Performance of BGDRP is 
evaluated using real-time scientific application workflow. The outcome shows BGDRP technique 
minimize data access latency, computation time and cost over state-of-art technique. 

Keywords — Cloud computing, Bipartite graph, Data replica placement, Fault tolerant, ILP, SAN, 
SDN. 


i. Introduction 

In recent years, Big Data applications (such as scientific, data intensive and Video on Demand (VoD) 
services) becomes the most emerging applications in the field of next generation computing platforms due to 
the massive enhancement of data creation and storage in real world. According to a 2012 research, the 
successive increment of data led to carry some terabytes data to numerous petabytes data in a single dataset 
[1]. The Big Data applications consists various features like huge capacity, large velocity and highly diverse 
information which needs various processing methods to enable optimization of methods, insight searching 
and precise decision making [2]. There are various areas in real world applications where massive amount of 
data generated everyday such as telecommunication, medical, pharmaceutical, internet surfing, business and 
information technology. 

Efficient storage (Data replica placement) and transmission mechanism is required, which is considered to 
be critical component of such real time computing application. The storage platform can be either centralized 
or distributed in nature. For achieving scalability, reliability, availability, and durability distributed 
architecture is adopted by various researcher. The storage a prone to disk failures, as a result data are stored 
across servers to provide durability and avoid single point failure (Fault tolerant). Scalability minimizes the 
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data access latency across servers/datacenter and reliability provisions the correctness of the data. Several 
storage technology have been presented in recent times such as Cassandra [3], Freenet [4] and Bigtable [5] 
with different features. Therefore when designing storage architecture it is important to identify the most 
significant features. The real-time application such as Bioinformatics, scientific, and space research 
application services requires low latency data access and transmission methods. 

In [6] and [7] presented scientific framework namely XrootD [6] and NetCDF [7]. These application are 
generally read only or append only. Hence, requires high VO (input/output) request on the storage 
architecture, which enables parallelization within application and storage architecture. To provision scalable, 
high performance and low latency storage architecture different technologies such as Network attached 
Storage (NAS), Direct-attached Storage (DAS) and Storage Area Network (SAN) has been adopted. The 
outcome obtained in [8] shows that the SAN gives better performance than NAS. Provisioning efficient 
resource allocation for user in SAN involves numerous challenges such as data placement and data 
reconfiguration. Minimizing data access cost and latency on such platform is most desired. In [9] and [10] 
presented cache optimization, cost optimization and reconfiguration method for data placement. However, 
they are not efficient for present dynamic computing application which requires fault-tolerant data placement 
and transmission mechanism. To provision fault tolerant requirement cloud computing framework is been 
adopted. 

Moreover, in recent years, a phenomenal growth in usage of cloud computing applications have also been 
seen due to its pay-as-you-go tactic and huge promotions by its various service providers. A Cloud 
computing application is a distributed type of computing application which can offer services on-demand 
over the internet [11]. Cloud providers like Amazon and Microsoft provides various resources which are 
arranged in the form of virtual machines (VMs) under Infrastructure-as-a-Service (IaaS) model of Cloud 
computing [12] of any scale. The reason for the immense growth of Cloud computing application is the 
saving of large computational time and storage capacity and availability of various resources. To perform any 
given task on virtual machine, the amount of time needed is clearly depend upon the length of the task 
(million instructions) and computation power of virtual machine (million instructions per second per core) in 
cloud computing application. In cloud applications, various functions can be executed with different level of 
criticality and that can enhance their execution time. Therefore, to perform millions of tasks at a time, an 
efficient data placement and transmission technique is required. Using data placement and transmission 
technique, the execution time and cost of tasks can be lowered. 

To undertake the benefit of SAN and Cloud computing framework several hybrid [12], [14], and [15] and 
heterogeneous [16] approaches have been presented. The future SAN model should consider heterogeneity of 
storage in provisioning real-time services to users. In [17] adopted virtual resource partitioning for cache 
optimization for heterogeneous I/O workload on virtualized storage environment. However, the model is not 
efficient and adaptive in nature. Since it did not consider dynamic traffic pattern of user to solve data 
placement problems. To address [18] presented a checkpoint based placement optimization algorithm which 
utilize both burst (traffic) and parallel filesystem. However, it incur latency and request failure [19] as data 
are stored across different location. As a result, incudes high cost and computation overhead [20], To 
minimize latency of data access [21] considered data replication placement. Data replication is a method of 
storing same data across different node/datacenter for providing fault-tolerant with minimal latency data 
access. To solve the problem of data replication placement they presented a genetic algorithm based strategy. 
However, there model suffer from integer linear programing (ILP) problem [22] as a result incurs high 
computation overhead. To overcome the research issue, graph partitioning and optimization technique is 
adopted in [16], [17], [18], [21], [25], and [26] respectively. This work present a Bipartite Graph based Data 
Replica Placement (BGDRP) and data transmission technique for Cloud based Storage Area Network to 
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provision execution of real-time workflow. The BGDRP technique aims at minimizing data access and 
transmission latency, computation time and computing cost. 

The Contribution of research work is as follows: 

• This work consider Bipartite graph based model for data replica placement on cloud based SAN 
network. 

• We consider multi-objective function to find optimal data replica placement and data transmission 
solution. 

• Experiment are conducted on real-time work flow and performance is evaluated in terms of 
execution of task completion time and cost and latency. 

• The outcome shows significance performance over state-of-art architecture. 

The rest of the paper is organized as follows. In section II the proposed fault tolerant data replica 
placement algorithm for cloud based storage area network is presented. In penultimate section experimental 
study is carried out. The conclusion and future work is described in last section. 

ii. Proposed Fault Tol e rant Data Replica Placement Algorithm for Cloud based Storage 

Area network 

Here we present a fault tolerant data placement mechanism for cloud based Storage Area Network (SAN). 
To provide fault tolerant service provisioning, same data are placed across different storage location or 
datacenters. This process is called replication. This work adopts a graph based data placement model to solve 
the unawareness of the difference among locations and its relationship among multiple objects [23]. Let 
consider a Bipartite graph L(K, H), where K represent the vertices and H represent the edges. The graph 
support multiple vertices for each edges while for edges only two vertices are allowed utmost. This model 
considers set of vertices with all datacenter and data objects which is represented as 

K = IUJ. (1) 

The edge set H represent all the request patterns and all the pair among each data objects and datacenter 
which can be defined as follows 

H = {h a \a G c/4} U [hij\i E I,j E]}. (2) 

This work adopt Bipartite graph, as a result there exist multiple data objects for every request pattern edge. 
Each edge h E H is a given a weight to assure certain QoS requirement of data placement, in order to 
minimize latency of data access by end client. Since this work considers multi-objective function [23], we 
set the weight of every edge in the graph to the multi-objective function which is shown as follows 

s = m. (e M ,<2 [<?] ,Q [s] ) c/ (3) 

where M is the weighted vector of multi-objective optimization metrics factor. More detail of Bipartite graph 
based data placement objective function can be obtained in [23], In this work, we consider both data objects 
and its replica as replication. The data placement is more challenging when replication of data objects is 
allowed. The cost of replication depends on the number of replications and location of replica of data object 
placed. In this work, we consider z number of replicas for each data objects. Since we need do determine 
replica location, the data placement mapping operation is optimized to 

- >Jz )■ (4) 
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We further need to address the data transmission solution problem, since the request for data object i can 
be satisfied at given location possessing a replica of A Now, we need do determine data transmission 
mechanism as mapping operation 

JC: (i.aj) -» j z , (5) 

which can give the data transmission target G J for each object im a pattern a from datacenter j. An 
important thing to be considered here is that it should include both y and X for performing replication. The 
data transmission should be performed based on given replica placement, post completion of data 
transmission solution, the placement obtain in previously may not be optimal. As a result makes data 
transmission considering replication more challenging. 

To address data placement problem due to replication, in this work, we present an optimization for 
efficient data placement for cloud based Storage area network which composed of three stages. In first stage, 
by applying simple greedy method we solve preliminary replica placement of data. In second stage, the 
native data transmission solution is made for each request pattern from each datacenters considering presence 
of replicas. Then the request pattern a attached with each request rate <^is optimized for an explicit set of 
replicas. In stage three, based on optimized request rate toward replicas, replica placement solution is 
performed in the space of replicas. The algorithm of optimized data placement considering replication is 
shown in Algorithm 1. 


Algorithm 1: Data replica placement on cloud based storage area network 

Step 1: y <- Stage 1 (W) 

Preliminary data replica placement 

Step 2: S <- 0 


Step 3: repeat 


Step 4:K <r- Stage 2(T/) 

Data transmission solution 

Step 5: Q = SlouQ(X,W) 

Acquire task to replicas 

Step 6:W = {Q.A} 

Inputs in the replica space 

Step 7: y *- Stage 3(W) 

Bipartite graph partitioning 

Step 8: S end <- £ 


Step 9: £ <- S(K,y) 


Step 10: until S — S end < 

T 

Step 11: Get X,y 



In stage (3), we consider the replica placement solution in the space of replicas based on the optimized 
request rates towards replicas. Stage (2) and (3) are iteratively applied until the enhancement is smaller than a 
threshold parameter. The architecture of proposed detail of each stage of BGDRP is given in Fig. 1. 
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Stage 3: Replica Placement Solution 

Fig. 1. Architecture of proposed BGDRP and transmission technique 


a) Preliminary placement: 

Here for generating preliminary replica placement we present a greedy method, which is demonstrated as 
stage 1 in Fig. 1. For each data i E J, we acquire the set M t = [Q^ \j E j], signifying request rate of data i 
from different storage locations, and sort it in the descending order. In our work we have considered z 
number of replica for data i and z datacenter with highest rate in M ri are selected to store the replicas of 
item i. This preliminary placement aid in guaranteeing that the resultant cumulative communicating 
load/traffic is minimized. Preliminary placement method is better than state-of-art arbitrary preliminary 
placement algorithm. However, in this stage we have not considered performance parameter into 
consideration will not affect the performance, as all optimization parameter is used in later stages. 


b) Data transmission solution: 

The major issue by allowing replicas in cloud based storage area network management is to find ideal data 
transmission model based on present status of the replica placement, which is shown as stage 2 in Fig. 1. For 
a requested pattern a at source datacenter#, we can enhance the replica utilized to satisfy all the objects 
requested in pattern a. We now express it as binary optimization problem as follows 

min 

j-EJ iea je3 

such that ^ l(i E Uj) = 1, Vi E a 

Vi G a, V/ G J 
lit E {0,1}, Vi E E 3 
Oj e {0,1}, V/ e 3 
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Where S e is a constant M. ( a, cp, 0, ) u under the present placement and S% is also a 

( rx/i roi 

0 ) • The ideal strategy of Eq. (6) guarantees the minimized value of Eq. (3) under 
any obtained replica placement. The binary parameter 3^ is utilized to denote whether an object i E a will 
be transmitted to the datacenter j. And the binary parameter 7 f - represent whether the datacenter j is active or 
utilized in the transmission of a. The bounds guarantees that each object i E a is actually transmitted to a 
datacenter the replica of i and being utilized. The objective of our model is to minimize the cost induced by 
satisfying of request a from datacenters. First part involves number of datacenter and second part involves 
inter-datacenter load and latencies in satisfying a. And this will aid in achieving objectives of Eq. (3). 

The first part of objectives will lead to set-cover problem, which lead to NP-complete problem. As a 
result, this work consider second part, which is fairly small, such that for each object i we can just select the 
data center storage j which makes minimized. The set-cover problem is addressed through linear 
programming relaxation, where we ease all the parameters to the number in the range of zero to one. The 
parameter can be considered as the likelihood that the corresponding parameter will be set to be one in the 
final solution. In our work, we retrieve the solution parameters in the form of likelihoods considering 
relaxation and the linear programming problem can be addressed in polynomial time. Then, for each data i E 
i, we select its serving data center storage by arg max /G j 3y, which can be considered as selecting the data 
center storage that has the maximal likelihood in serving i. The state-of-art set-cover problem uses only 3^ 
for obtaining the final solution. However, in our model we further considers the second part in the objective 
functions. 

c) Replica placement solution: 

The replica placement solution is obtained by extending the strategy for the case without replicas. Here we 
represent replica as ft and set of replica by T. Post completion of stage 2, the data transmission solution K is 
obtained, we can express the request rate to each replica. Now we optimize the workload set from Q = [Saj] 
toQ = {Qd?}, which is shown as SlouCj in Algorithm 1. The difference among a and a is retrieved in the 
replica space. Formally, a e [T, 0} w . Particularly, a can only specifies whether a data object i is in the 
request pattern a , but d shows whether particular replica of each object i E a essentially involved in 
satisfying the request. 

Then in stage 3, with the retrieved workload in the replica space, we decide the data replica placement 
decision by extending the Bipartite graph construction. The vertices in the Bipartite graph become the union 
of the datacenter set and replica set. In the edge set, the data-datacenter edge are replaced by the replica- 
datacenter edge. The weight of edges are established as follows 

{ M. (nt^\ o) > f or each hyperedge A#, ^ 

'LL 

M. (mf/, 'inf-, mfj') + 8, for each edge A n 

Using Eq. (7), we can apply Bipartite partitioning strategy as similar to methodology without replicas. The 
computation complexity of the Bipartite partitioning strategy is $(( \3C\ + \3~C\) logT), so the computation 
complexity of our model is not higher than d((|c/Z|7’ + zKT) logT). 

We now simplify Eq. (7) in fixing the weights of all edges in the form of A^. For each replica ft, we only 
consider the edge A^ with the maximal weight in the set of {A^\j E j}. This aid in giving higher 
partialness to not cutting the edge with maximal weight in the datacenter edge set associated with replica. 
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Our approximation aid in reducing number of edges and reducing computation time which is experimentally 
shown in later section of paper. 

Replica placement solution can be obtained by applying Bipartite graph partitioning [23], which is 
actually the input of data transmission solution strategy in the next step. After each set of iteration of the data 
transmission solution and placement solution, we would stop the iteration once improvement is less than 
threshold r. Lastly, the data transmission and placement solution in previous iteration are transmitted to the 
datacenter in the cloud based storage area network. With the deterministic data transmission solution K, we 
can retrieve a hash mapping operation for each datacenter storage, whose input is a request pattern and output 
is the data transmission target/end of each object in the pattern. Such an operation guarantees communication 
of any requests can be processed in minimal time/latency which is very key factor for cloud based storage 
area network. In next section the performance evaluation of proposed BGDRP and transmission technique 
over existing system is presented. 


hi. Simulation Result and Annalysis 

This section presents performance evaluation of proposed BGDRP over exiting methodology in terms of 
latency, computation overhead time and computing cost. The experiment are conducted on windows 10 
enterprises edition operating system, Intel 1-5 quad core processor with 16GB RAM with 4 GB dedicated 
CUDA enabled GPU. This work consider real-time scientific and data intensive workflow application such as 
Inspiral and Montage. The workflow is obtained from [24], The proposed and existing methodology is 
designed using JAVA 8 using eclipse neon IDE. The proposed BGDRP technique performance is evaluated 
interm of workflow latency, computation overhead time and computing cost and is compared with existing 
model [18]. 

a) Data Replica placement Latency performance considering different real-time workflow: 

Experiment are conducted to study the performance achieved by BGDRP over existing approach [18] in 
term latency achieved for executing task. Here we considered two real-time work flow such as Inspiral_1000 
and Montage_1000 workflow. The number of datacenter are varied from 20 to 80 and each datacenter is 
composed of 10 nodes with data replication size is set to 5. The user is fixed to 500 users. The experiment 
study shows that the proposed BGDRP performs better than exiting approach in term of latency achieved. A 
latency minimization of 7.57%, 10.86%, 11.6%, and 11.96% is achieved by BGDRP over existing approach 
when datacenter size is 20, 40, 60 and 80 respectively, considering Inspiral_1000 workflow as shown in Fig. 
2. An average latency minimization of 10.5% is achieved by BGDRP over exiting approach considering 
Inspiral workflow. Similarly, latency minimization of 13.8%, 17.00%, 19.28%, and 20.11% is achieved by 
BGDRP over existing approach when datacenter size is 20, 40, 60 and 80 respectively, considering 
Montage_1000 workflow. An average latency minimization of 14.02% is achieved by BGDRP over exiting 
approach considering Montage workflow as shown in Fig. 3. An overall latency minimization of 12.56% is 
achieved by BGDRP over exiting approach considering different case studies. 
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Task execution latency (Inspiral_1000) 
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Fig. 2. Latency performance considering Inspiral_1000 workflow 



Fig. 3. Latency performance considering Montage_1000 workflow 


b) Data Replica placement Computation time performance considering different real-time workflow: 

Experiment are conducted to study the performance achieved by BGDRP over existing approach [18] in 
term computation time achieved for executing task. Here we considered two real-time work flow such as 
Inspiral_1000 and Montage_1000 workflow. The number of datacenter are varied from 20 to 80 and each 
datacenter is composed of 10 nodes with data replication size is set to 5. The user is fixed to 500 users. The 
experiment study shows that the proposed BGDRP performs better than exiting approach in term of 
computation time achieved. A computation performance improvement of 70.12%, 89.41%, 90.11%, and 
90.54% is achieved by BGDRP over existing approach when datacenter size is 20, 40, 60 and 80 
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respectively, considering Inspiral_1000 workflow as shown in Fig. 4. An average improvement of 85.044% 
is achieved by BGDRP over exiting approach considering Inspiral workflow. Similarly, computation 
performance improvement of 82.11%, 93.63%, 94.22%, and 94.48% is achieved by BGDRP over existing 
approach when datacenter size is 20, 40, 60 and 80 respectively, considering Montage_1000 workflow. An 
average improvement of 91.11% is achieved by BGDRP over exiting approach considering Montage 
workflow as shown in Fig. 5. An overall computation performance improvement of 87.5% is achieved by 
BGDRP over exiting approach considering different case studies. 
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Fig. 4. Task execution time considering Inspiral_1000 dataset 
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Fig. 5. Task execution time considering Montage_1000 dataset 
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c) Data Replica placement Computing cost performance considering different real-time workflow: 

Experiment are conducted to study the performance achieved by BGDRP over existing approach [18] in 
term computing cost for executing task. Here we considered two real-time work flow such as Inspiral_1000 
and Montage_1000 workflow. The number of datacenter are varied from 20 to 80 and each datacenter is 
composed of 10 nodes with data replication size is set to 5. The user is fixed to 500 users. The experiment 
study shows that the proposed BGDRP performs better than exiting approach in term of computation cost 
achieved. A computing cost reduction of 27.37%, 29.96%, 30.54%, and 30.83% is achieved by BGDRP over 
existing approach when datacenter size is 20, 40, 60 and 80 respectively, considering Inspiral_1000 
workflow as shown in Fig. 6. An average computing cost reduction of 29.67% is achieved by BGDRP over 
exiting approach considering Inspiral workflow. Similarly, computing cost reduction of 32.26%, 34.79%, 
36.58%, and 37.23% is achieved by BGDRP over existing approach when datacenter size is 20, 40, 60 and 
80 respectively, considering Montage_1000 workflow. An average computation cost reduction of 35.21% is 
achieved by BGDRP over exiting approach considering Montage workflow as shown in Fig. 7. An overall 
latency minimization of 32.6% is achieved by BGDRP over exiting approach considering different case 
studies. 


Task execution cost (Inspiral_1000) 
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Fig. 6. Task execution computing cost considering Inspiral_1000 dataset 
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Task execution cost (Montage_1000) 


FYi^tinCT lVTnrlpI ■ RfTTYRP A/Tnrlpl 



20 40 60 80 

Number of datacenter 

Fig. 7. Task execution computing cost considering Montage_1000 dataset 

iv. Conclusion 

Developing an efficient storage and transmission mechanism for scientific and data intensive application 
is challenging. Since it requires low latency, cost, and computation overhead. Cloud based Storage Area 
Network has attained wide popularity in recent times due to its ease of use and fault tolerant guaranties. 
Minimizing cost with performance guarantee on such platform is most desired. Providing fault tolerant and 
continuous access to data with minimal latency and cost is challenging. To provide fault tolerant data access 
and transmission this paper presented a Bipartite Graph based Data Replica Placement technique. The 
BGDRP aid in minimizing latency and computing cost. Our model is better than random or genetic 
algorithm based data replication placement. Experiment are conducted to evaluate performance of BGDRP 
over existing approach using real-time workflow considering varied node/datacenter size with fixed user and 
data replication size. The outcome shows an average performance improvement of 12.568%, 87.5% and 
32.6% is achieved by BGDRP over existing model in terms latency, computation time, and cost respectively. 
The outcome shows BGDRP technique minimize data access latency, computation time and cost over state - 
of-art technique. The study shows the efficiency, scalability and robustness of our model. The future work 
would consider minimizing energy as it is directly proportional to cost and aid utilizing resource efficiently. 
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Abstract — In this paper person identification is done based on 
sets of facial images. Each facial image is considered as the 
scattered point of logistic regression. The vertical distance of 
scattered point of facial image and the regression line is 
considered as the parameter to determine whether the image is of 
same person or not. The ratio of Euclidian distance (in terms of 
number of pixel of gray scale image based on ‘imtool’ of Matlab 
13.0) between nasal and eye points are determined. The variance 
of the ration is considered another parameter to identify a facial 
image. The concept is combined with ghost image of Principal 
Component Analysis; where the mean square error and signal to 
noise ratio (SNR) in dB is considered as the parameters of 
detection. The combination of three methods, enhance the degree 
of accuracy compared to individual one. 

Keywords- LDA; normalized error; eigen value; SNR and 
covariance matrix. 

I. Introduction 

Pattern recognition is a vast filed of image 
processing/computer vision in recognition of an object. For 
example, face recognition approach of [1, 17, 18] known as 
‘biometric techniques’ used to recognize a person using the 
concept of features of image as used in pattern recognition. In 
Linear Discriminant Analysis (LDA) the variance among 
‘image class’ is considered as the parameter to identify an 
image found in [14]. Another statistical model is linear 
regression-based classification (LRC) which is simple but 
efficient discussed in [15] and similar concept is also found in 
[16] where the method is designated as ‘locally linear 
regression (LLR) method’. Here authors generate the virtual 
frontal view from a given non-frontal face image hence the 
method becomes pose-invariant since recent face recognition 
techniques experience the difficulties with poses. An Efficient 
method for face recognition is based on Principal Component 
Analysis (PC A) explained in [2, 3, 4, 19]. In PC A based face 
detection few training images of same dimension are 
converted to vectors. The average vector and the difference 
vectors are then evaluated. Next the eigen values and eigen 
vectors are evaluated from difference vectors as explained in 
[5]. Converting the eigen vectors into an image matrix 
provides eigen faces. Next a weighting factor and 


corresponding weighting vector is computed. Finally, the 
Euclidian distance between weighting vectors of the training 
images and the test class provides the identity of face. Instead 
of group of pictures and their eigen faces for training, a single 
image face recognition approach using Discrete Cosine 
Transform (DCT) is proposed in [6] where both the local and 
global features of face are extracted using both DCT and 
zigzag scanning from the co-efficient of DCT. Here images of 
size of 128x128 are taken from database and out of 16384 
coefficients only 64 are considered as the feature of the image. 
Similar analysis is found in [7] where additionally the co¬ 
ordinate of eyes is put manually to normalize the face. The 
accuracy of the system is found about 95% which depends on 
the number of DCT co-efficient. 

In image processing two-dimensional wavelet transform is 
widely used with some threshold to preserve the most 
energetic coefficients for both de-noising and compression of 
image even identification of an image. For this purpose, two- 
dimensional filter bank of [8] or wavelet packet transform of 
[9-11] is widely used. Most widely used technique of pattern 
recognition is to select an image of 2 n x2 n ; where n is a positive 
integer. Discrete wavelet transform (DWT) is applied on the 
first column vector and the approximation coefficients vector 
which is just half the size of original column vector evaluated. 
Actually, the approximate component is found simply from 
the convolution of column vector and the impulse response of 
low-pass filter. The DWT is applied on the column vector 
recursively until getting a single point. If such operation is 
applied on each column vector then we will get a row vector 
for the image. Finally, DWT is applied on the row vector 
recursively until getting a vector of length 8 which actually the 
8 minimum spectral point of the image. The resultant vector 
can be considered as a feature of an image as discussed in [12- 
13]. 

The paper is organized like: section 2 provides the PC A 
algorithm along with its modified version, section 3 provides 
the results based on analysis of section 2 with example and the 
section 4 concludes the entire analysis. 
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II. Methodology 


Sometimes it is necessary to relate data of dependent and 
independent variable by linear, exponential or higher order 
polynomials to detect unknown points. Extension of graph of 
such relation is used to predict expected data in future. In 
linear regression, the relation between v and y is, 

y=a+bx (1) 

Here a and b are constant determined from scattered data 
points (xu yi) using the relation, 


and 


b = 
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In this paper each image x, (along y-axis) and its index i 
(along v-axis) is used as the scattered point. The following 
algorithm is used in determining the coefficient of regression. 


Algorithm 1: 


Step 1: Read ith facial image from a database of facial images. 
Step 2: Select ROI using Viola-Jones Algorithm 
Step 3: Normalize the image (ROI) and assign it as x m - 


Step 4: Evaluate the matrix, y. = In 


'X.,' 

T 

K X ni J 




correspond to 
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Step 5: Repeat step 1 to 4 for i = 1, 2, 3,..., N 
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Step 7: Select a test image from the same or different database 
and indicate it as x*. 

Step 8: Apply step 2 and 3 on y t . 

Step 9: Determine the point on the regression line, y = a + bx t . 
Step 10: Determine error, e, = l(ly,l-ly 1)1; where i = 1 ,2,3,...,N. 
Step 11: If Here e* is a matrix of MxM ; hence determine the 

j M N 

normalized error, s = - / / e(i, j ). 

255.M.N —' 

i=l ;=1 

Step 12: Select a threshold value of error as, r. 


If e < r; then the test image is of same person, else it is 
different person. 


The second part of the paper will deal with detection of nasal 
and eye points to measure the distance of the points. 
Corresponding algorithm is given below. 

Algorithm 2: 

Step 1: Read facial image from a database of facial images 
and take ROI using Viola-Jones Algorithm. 

Step 2: Normalize the image (ROI) to NxM and designated as 

y- 

Step 3: Show the image y and hold it for the following pseudo 
code. 

Step 4: Initialize variables, a = 0, P = 20; Q = 25; L = 20; H 
=25; 

for t = 1 : N 
for s = 1 : M 
k = 0; 
ap = 0; 
for i = P: Q 
forj = L: H 
k=k+l; 
a(k) = y(i, j); 
end 
end 

ap = mean(a); 
ifap < = 40 

plot(uint8(mean(L : H)), uint8(mean(P : Q)), V*') 
end 

hold on 
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P = P+5;Q = Q+5; 
End %%end of s loop 


p = 20; Q = 25; 

L = L+5;H = H+5; 
end %end oft loop 

Step 5: Apply ‘imtool’ on the resultant image to measure the 
distance between nasal and eye points. 

Finally we used modified PCA algorithm (like regression on 
image vector) on facial image to determine ghost images 
corresponding six largest eigen values. The separation 
between images and ghost images provides error. The 
algorithm of modified PCA is given below. 

Algorithm 3: Conventional PCA algorithm 


Step 1: Let select the training images A, h , h , Im each of 

size NxN and convert them to column vector as: IT, IT, IT,.. 
Tm each of size of Axl. 


1 

Step 2: Determine the average vector, V|/ = ^ i • 


(known as average face) and difference vectors, 

<D.= r. — Y|/;i = 1,2, 3,.., M. 

Step 3: Let us create a matrix, A = [€» 1 Oj O m ] 

and the covariance matrix, C = AA r . The covariance matrix 

1 M 

can also be evaluated as: C = — > O ' d) . The size of 
matrix C or A is A^xA 2 . 

Step 4: The M orthogonal eigen vectors Uk (where 
CV = 8 k j ) and corresponding eigen values Ik are selected 


from the covariance matrix C indicate the principal 
components of data. 

Step 5: Let us now select a new test image and determine its 
vector T. The projection of T on face space is: 

u:(r-v) = (O k is called weight of face k. Let us define 
weight vector, Q = [to, (0 2 • • • (O m ]. 

Step 6: The Euclidean distance: £ k — ||l2 — 12^ || is measured 

and if the distance is greater than a threshold value 0 then the 
test image is unknown otherwise it is under the class of same 
database; where n k is the weight vector of the image k. 


Algorithm 3.1: Modification of PCA algorithm 

Above algorithm is modified and corresponding steps are 

given below. 

Step 1: (O k = U[ (t. — V|/) ; k =1,2,3,..., Mfor each value of L 


Step 2: Let us construct the vector Q. = [tOj (0 2 ••• ® M ] for 

ith image. 

Step 3: Let us determine another vector: 

© ; . = min djo,. - Qj | |Q ( . - Q 2 1 • • • |Q ; . - Q M |]); where 

the term IQ^- — Clj | for i = j does not exist. 

Step 4: Next take the minimum value of the vector of step 

3,A,. =min([0, © 2 ••• ©mD- 

Step 5: Let us construct the vector = [o) { ® 2 ••• ® M ] for 
test image and determine the 

parameter, A (fJ( = mill (|Q (o( - Q, | -Q M |]). 


Step 6: If ||A S — A tesf || < £ then the test image is under the 

same group otherwise it is not the image of same group or 
database. 

Step 7: Let us take the mean error, 

M 

■ Cl, I and SNR in dB 


J M 

Error = ^ |Q. 

Mtr 1 


r 


SNR = 10 log 


1 M 

— y|q„ 




1 M 

v M§' n ”'- n ' 


According to conventional PCA algorithm and its 
modification two examples are shown here. 


III. Result 

The vertical difference between scatter point 
(corresponding to facial image) and the regression line for the 
images considered under regression, images of same data base 
but not used in regression and images of different person is 
shown table I. Here we take the regression parameter, K- 40, 
50, 60, 70 and 80. For all cases the results are found almost 
same. The error is found minimum (less than 7%) when the 
images under regression are taken hence convergence of data is 
verified. When images of same person are taken (not used in 
regression) provides error less than 10% but images of different 
person provide error above 11%. Taking a threshold error of 
10% we can distinguish persons. 

TABLE I. 


Vertical distance between regression line and scattered points 


Value 

Same Database 

Different Database 

o 

, \t- 

ii 

Same Image 

Different Image 


0.0556 

0.0624 

0.1693 


0.0543 

0.0652 

0.1573 


0.0738 

0.0937 

0.1612 


0.0523 

0.0994 

0.1434 


0.0536 

0.0924 

0.1497 


0.0680 

0.0995 

0.1441 

7^ = 50 

0.0505 

0.0568 

0.1532 

0.0493 

0.0593 

0.1426 
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0.0671 

0.0854 

0.1461 

0.0476 

0.0906 

0.1301 

0.0486 

0.0841 

0.1356 

0.0617 

0.0905 

0.1307 

o 

VO 

II 

0.0470 

0.0529 

0.1422 

0.0458 

0.0552 

0.1325 

0.0624 

0.0796 

0.1358 

0.0443 

0.0845 

0.1209 

0.0453 

0.0784 

0.1260 

0.0574 

0.0843 

0.1215 

K = 10 

0.0445 

0.0500 

0.1341 

0.0432 

0.0521 

0.1250 

0.0590 

0.0752 

0.1281 

0.0418 

0.0799 

0.1141 

0.0427 

0.0741 

0.1189 

0.0541 

0.0797 

0.1147 

ll 

oo 

o 

0.0424 

0.0477 

0.1277 

0.0412 

0.0497 

0.1191 

0.0563 

0.0718 

0.1221 

0.0399 

0.0764 

0.1088 

0.0408 

0.0708 

0.1133 

0.0516 

0.0761 

0.1094 


Next part of the results deals with ratio of length of nasal 
point (left and right) and eye (left and right) as the random 
variable. The ratio of Euclidean distance of eye-eye and nose- 
eye is given in table II for same person and in table III for 
different person. Few example of distance calculation is shown 
in Figure 1. Here, E-E means Euclidian distance of eye to eye, 
N-LE means left nasal point to left eye and N-RE means right 
nasal point to right eye in the table II and III. 



Figure 1. Euclidian distance between nasal and eye points 


TABLE II. 


Ratio of length of nasal point and eye for the images of same person 


Same Image 

E-E (Rl) 

N-LE (R2) 

N-RE (R3) 

R1/R2 

R1/R3 

01 

214.70 

150.51 

135.42 

1.4265 

1.5854 

02 

217.01 

150.42 

140.06 

1.4427 

1.5494 

03 

226.00 

157.27 

145.96 

1.4370 

1.5484 

04 

228.00 

159.62 

149.25 

1.4284 

1.5276 

05 

232.01 

165.60 

150.63 

1.4010 

1.5403 

06 

229.00 

161.72 

154.49 

1.4160 

1.4823 

07 

203.04 

139.62 

134.27 

1.4542 

1.5122 

08 

199.09 

140.66 

136.82 

1.4154 

1.4551 

09 

205.79 

143.27 

139.08 

1.4364 

1.4797 

10 

203.00 

138.81 

136.48 

1.4624 

1.4874 

11 

190.79 

142.80 

116.14 

1.3361 

1.6428 

12 

195.21 

150.71 

119.87 

1.2953 

1.6285 


TABLE III. 

Ratio of length of nasal point and eye for the images of different person 


Different Image 

E-E (Rl) 

N-LE (R2) 

N-RE (R3) 

R1/R2 

R1/R3 

01 

191.00 

119.67 

104.40 

1.5961 

1.8295 

02 

201.90 

123.46 

120.51 

1.6353 

1.6754 

03 

182.32 

124.60 

125.16 

1.4632 

1.4567 

04 

163.00 

133.78 

118.60 

1.2184 

1.3744 

05 

187.38 

133.27 

133.09 

1.4060 

1.4079 

06 

189.55 

120.88 

131.49 

1.5681 

1.4416 

07 

178.34 

120.93 

120.60 

1.4747 

1.4788 

08 

172.65 

127.91 

109.08 

1.3498 

1.5828 

09 

186.45 

151.70 

144.90 

1.2291 

1.2867 

10 

199.81 

149.05 

150.24 

1.3406 

1.3299 

11 

167.15 

96.33 

96.61 

1.7352 

1.7302 

12 

196.09 

135.15 

131.23 

1.4509 

1.4942 


For same person the variance of ration of distance are found 
as: 0.0024 and 0.0035. The results for case of different facial 
images are found as: 0.0253 and 0.0275, which are 10 time 
higher than the case of same person. Therefore taking a 
threshold value of variance we can take decision on facial 
images. 



Figure 2. Several facial images of database-1 with background 
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Figure 3. Removal of background using Viola-Jones 
algorithm on database-1 



Figure 4. Ghost images of PCA for six largest eigen values of 
database-1. 



Figure 5. Several facial images of database-2 with 
background 



Figure 6. Removal of background using Viola-Jones 
algorithm on database-2 



Figure 7. Ghost images of PCA for six largest eigen values 
of database-2 



Figure 8. Several facial images of database-3 with 
background 
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Figure 9. Removal of background using Viola-Jones 
algorithm on database-3 


Figure 10. Ghost images of PC A for six largest eigen values 
of database-3 

Next part of the results section we deal with PCA algorithm 
to recognize facial image of a person. In this section we 
analyze two databases (few images of the database are 
shown in Figure 2, Figure 5 and Figure 8 as an example. 
First, we choose 5 test images from same database (with 
complex background) and evaluate the average error and 
SNR for the largest eigen value since the ghost image for the 
highest eigen value gives better impression as shown in 
Figure 4, Figure 7 and Figure 10. The results are shown on 
the 1 st row of the table IV. Similar analysis is done taking 5 
test images from different databases shown in 2 nd row table 
IV. The mean error is found much smaller and SNR in dB 
are found much larger in 1 st row than that of 2 nd row 
validates the analysis. The maximum mean square error is 


found 0.329xl0 3 and minimum SNR isl5.74dB in 1 st row 
where the corresponding values are 2.84xl0' 3 and -28.516 
dB in 2 nd row. Therefore, SNR is found more sensitive 
parameter to identify a face. Instead of mean error and 
corresponding SNR if we use minimum error and maximum 
SNR as the deciding parameter we get better results 
visualized from table V. 


TABLE IV. 

Average Error and SNR correspond to highest eigen value 
(Complex background) 


Test images 

Database 1 

Same 

Different 

l 

Mean Error 

0.313 x10 s 

1.553x10 s 

SNR dB 

26.391 

-16.454 

2 

Mean Error 

0.329x10 s 

1.872x10 s 

SNR dB 

20.743 

-25.893 

3 

Mean Error 

0.322x10 s 

2.184x10 s 

SNR dB 

21.609 

-28.516 

4 

Mean Error 

0.309x10 s 

1.847x10 s 

SNR dB 

21.850 

-24.217 

5 

Mean Error 

0.308x10 s 

1.751x10 s 

SNR dB 

15.474 

-23.548 


TABLE V. 

Minimum error and maximum SNR correspond to highest eigen value 
Test image from different database (complex background) 


Test images 

Database 1 

Same 

Different 

i 

Mean Error 

9.195 

1.240 x10 s 

SNR dB 

92.301 

-11.948 

2 

Mean Error 

5.310 

1.558x10 s 

SNR dB 

103.99 

-22.226 

3 

Mean Error 

2.778 

1.534x10 s 

SNR dB 

116.70 

-20.495 

4 

Mean Error 

10.066 

1.870 

SNR dB 

90.320 

-25.416 

5 

Mean Error 

122.366 

1.558x10 s 

SNR dB 

42.422 

-22.225 


TABLE VI. 

Average Error and SNR 
Test image from same/different database 


Test images 

Database 2 

Same 

Different 

l 

Mean Error 

0.282x10 s 

6.332x10 s 

SNR dB 

47.325 

-91.012 

2 

Mean Error 

0.276x10 s 

8.851x10 s 

SNR dB 

54.804 

-36.990 

3 

Mean Error 

0.289x10 s 

6.092x10 s 

SNR dB 

48.783 

-54.726 

4 

Mean Error 

0.274x10 s 

5.454x10 s 

SNR dB 

47.527 

-62.935 

5 

Mean Error 

0.480x10 s 

4.329x10 s 

SNR dB 

37.138 

-40.551 
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TABLE VII. 


Minimum error and maximum SNR 
Test image from same/different database 


Test images 

Database 2 

Same 

Different 

l 

Mean Error 

181.263 

6.231xl0 3 

SNR dB 

80.333 

-8.799 

2 

Mean Error 

379.714 

7.830xl0 3 

SNR dB 

65.145 

-66.309 

3 

Mean Error 

478.056 

5.991xl0 3 

SNR dB 

60.339 

-6.858 

4 

Mean Error 

407.590 

5.044xl0 3 

SNR dB 

64.811 

-56.154 

5 

Mean Error 

331.528 

3.534xl0 3 

SNR dB 

68.795 

-40.551 


Taking a threshold value of mean error and SNR, we can now 
take decision whether facial image is of same person or not. 
We worked on 30 databases and found 4 errors on logistic 
regression (86.6% accuracy), 7 errors under 2 nd algorithm of 
Euclidean distance (76.67% accuracy) and modified PCA gives 
3 errors (90% accuracy). The combination of three algorithms 
(linear combinations of results) provides only 2 errors (93.3% 
accuracy). 

IV. Conclusion 

In this paper we presented an interactive face recognition 
algorithm of a test human face using the database of images of 
faces of a person based on modified PCA algorithm. Here we 
consider the ghost image correspond to the largest eigen value 
for matching purpose based on square error and SNR. The 
entire work can be implemented using logistic regression 
method where each scattered point on (x, y) plane correspond 
to the image number and the image matrix. A point on the 
middle of the least square curve will be the ghost image then 
above operation can be applied to identify an image. Here, 
Modified PCA and logistic regression method can be compared 
in context of complexity of algorithm and process time. The 
PCA or regression method can also be applied in identification 
of fingerprint or other biometric identification. After combining 
the three algorithms (logistic regression, Euclidean distance, 
and PCA) the accuracy is 93.3% which is better than previous 
face recognition system. Our future work will be focused on 
improving the efficiency of the algorithms using DWT, LDA 
method, FUZZY system and ANFIS model. 
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Abstract — REST web service is a technology in which it is applied 
commonly in the distributed enterprise application model. The high 
number of requests and data complexity that are received by REST 
web services simultaneously become a determining factor of the 
REST performance itself The bigger data size that is sent then 
response time which is produced by REST web service also becomes 
high as the effect of processing that takes place in the data source. 
Database is one of the data sources that is generally used in 
distributed enterprise application which is based on REST web 
services. However, database implementation with data processing 
mechanism application according to the arrival sequence still has 
limitation. Technically, query consumption resulted to meet the 
mechanism becomes more complex. Besides that, resources that are 
needed are also getting higher along with the increasing of requests 
and data. RabbitMQ is one of the data sources and a light message 
broker and it adopts FIFO (First In First Out) concept in 
processing the data. This research also conducts implementation 
and evaluation of Rabbit MQ on REST web services. In addition, 
comparison on each of REST web services which uses single 
database and RabbitMQ as data storage is also conducted. It gives 
the output in the form of engineering on the data flow that is 
received by REST web services by locating RabbitMQ between the 
REST web services and database. This engineering is based on the 
performance evaluation that is resulted by each of the data source. 

Keywords: REST web services, Distributed Application, 
RabbitMQ, Message Broker 

I. Preface 

The high number of requests is one important part that 
cannot be separated in the distributed enterprise application that 
is based on REST web services. Request that is accepted from 
various locations can simultaneously give influence on the 
performance and response time of REST web services. It is 
because the data contained in the request which is sent has 
different sizes to be stored or to be processed again. The bigger 
data size that is sent and the high number of requests will give 
additional time in data storing and processing so that the 
produced response time also becomes high. At this current 
time, the common system used in data processing is database. 
However, database has some limitations in its application. 


Database performance may decrease when high number of 
requests is accepted for query process implementation. The 
decreasing database performance is caused by high load 
resources and in certain cases database does not have 
notification mechanism to filter the old and the new data. 
Database needs mechanism to determine data sequence 
according to its arrival so that the data received by the database 
meets FIFO (First In First Out) principle. This mechanism is 
often handled in handcode application by adding a new column 
to differentiate the old and the new data and also the data which 
is being processed. The process brings effect for the number of 
query that is written and run to process and to meet the 
notification mechanism. By the time REST web service 
continues more than one data at the same time to the database, 
the database will do query on each request that is sent and 
marks that column. The higher number of request then forces 
the database to do query repetition to process the new data and 
to update it to meet notification mechanism and data marking. 
Besides that, the query process and time consumption depend 
on the data size that is received. If query for one data needs 
quite long time, it will influence on the database workload that 
escalates significantly and it influences another data which is 
also received at the same time. This is generally happened in 
the case of big size data transfer and high number of request. In 
addition, database application with FIFO concept on the data 
will be affected on query complexity that should be 
implemented. High database workload will influence system 
scalability from the resources side or hardware. Resources that 
are owned will increase and addition will go along with the 
increasing need of high request. These limitations cause the 
database cannot be precise if the application serves as queue 
system with FIFO concept on the distributed system that is 
based on REST web services. Message broker is used to 
transfer the message between the source and the target server. 
With message broker, the data that is transferred from one 
location to another location can be faster and more precise [11], 
[13]. Message broker ensures that the message that is stored is 
success without interfering locking transaction such as needed 
by the database in executing query. This causes relatively 
smaller resource consumption than the database, but it gives 


42 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 2, February 2018 


better coverage. Database is good to store more structured data, 
but message broker use is better when it is compared with 
database in processing high request simultaneously. Therefore, 
the goal of this research is message broker implementation on 
REST web services in which the product that is used is 
RabbitMQ. This research brings evaluation on REST web 
services which use single database and message broker as data¬ 
storage. Besides that, this research conducts engineering on the 
data that is received by REST web services by locating 
RabbitMQ between REST web service and database based on 
the evaluation that is resulted by each data source. 

II. Literature studies 

There have been many researches about web services 
performances that use various technology, architecture, 
method or different scope. There are two kinds of web 
services that are used; they are REST and SOAP (Simple 
Object Access Protocol). Commonly, discussion and 
implementation and also web service development is 
influenced by some variable that are used. Those variables can 
be categorized into data size, kinds or data type and response 
time that is received. Data size is implemented by making 
some functions that receives data in form of text parameter 
with certain size on the function of web services. Besides, the 
kinds or data type is also varied; they are text, byte, or numeric 
[3]. From these three data types, numeric data type has smaller 
response time than text type or byte type for each of REST or 
SOAP web services. The smaller response time resulted then 
performance that is produced by web services is better. In 
another side, architecture that is owned by each of REST and 
SOAP is different in the context of handling the produced 
request and response. Commonly, these two kinds of web 
services have four main components; those are Http Listener , 
Request Handler , Parse Module and Web Servlet [4]. Web 
services performance can also be influenced by the used 
method. Several methods that can be used to develop web 
services performance in its implementation are such as 
compression, partial request and cache [5]. REST or SOAP 
web services use on the data upload implementation with the 
kind of data image on mobile device in which the data also 
becomes reference for performance evaluation of these two 
types of web services. Besides that, development and 
implementation of each of the web services is conducted in 
some variations by modify API use (Application Programming 
Interface), protocol, or cloud system [6]-[8]. From the 
research that has been done for comparison performance of 
REST and SOAP, it can be concluded in common that REST 
web services give better comparison result than SOAP web 
services [9]. Nevertheless, SOAP web services use can be 
used for more specific condition, such as for a client who 
needs object that is formed beside the server in real time and 
focuses on that object security [10]. 

Better comparison performance on REST web services is 
addressed in several sides; they are technology, framework, 
data size or methods that are used. Therefore, REST web 
service has been used much in the implementation of 
distributed web application at this moment than SOAP with 
consideration of better performance that is possessed by REST 
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web services. This becomes the basic for this research in taking 
REST web service as its research object in which it is 
supported with other systems whether it will keep providing 
better performance or the reverse. That system is focused for 
the storage media with some variables that are classified into 
text data size, response time and data integrity. Other 
researches related with message-broker are done to provide 
description on its use on data processing on the distributed 
system, even on cloud technology [2], [13]. RabbitMQ is one 
technology of message-broker that can be used in conducting 
evaluation on high-availability and fault-tolerant on the 
middleware clustering [1]. The main difference of this research 
with the previous researches is addressed to the request flow 
engineering that is sent based on the evaluation of REST web 
services performance which uses SQLServer and RabbitMQ. 

A. RabbitMQ 

RabbitMQ is a queue system and message-broker that is 
based on open source that is use much in processing big data 
amount [14]. This system provides easiness in data 
distribution on the communication among different system and 
it uses AMQP (Advance Message Queueing Protocol) as the 
protocol to communicate between the producer and consumer 
[15]. Producer on RabbitMQ is a node that conducts request in 
form of message that consists of the data that will be sent. The 
data that has been received will be continued to the customer 
who has a knot to RabbitMQ [16], [17]. Consumer is simpler 
because it only receives all RabbitMQ message by identifying 
payload and label which are on the message. Figure 1 shows 
the flow that takes place on RabbitMQ. 

RattbitMQ 


I&MCt ni™ rmjiiQst. Y njqunsl. 



Fig 1. Workflow of RabbitMQ 


III. Methodology and System Design 

This research divides the system into 3 layers; they are 
client, REST web services, and data source. The first layer is 
client scope that accesses and uses service of REST web 
services. On this layer, every request that is sent to REST web 
services will be responded by REST web services with the 
result that has been processed. In this part, client does not 
know in detail how the message is processed and stored. The 
second layer is implementation of REST web services where 
on this layer it manages how data request is received, 
processed, and continued to the next layer. This layer will be 
built by using Microsoft MVC .NET v4 technology. The third 
layer is the data source in which it uses Microsoft SQLServer 
and RabbitMQ product. This layer is used to receive data 
request from the second layer and then process it to be stored. 
This research implements data flow engineering that is 
happened between REST web services layer and data source 
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layer. This engineering will be based on the performance of 
each data source that is accessed by REST web services. It is 
expected that result of this engineering may bring good 
performance on the layer of REST web services. 

Besides that, this research uses a PC server with the 
following specification: (1) Processor Intel Xeon CPU 
2.60GHz and Memory 32GB, (2) Microsoft Windows Server 
Operation system 2012R2, (3). Microsoft Internet Messaging 
Service v8.5.9, (4) Microsoft ASP MVC NET 4 application. 
(4). Microsoft Sql Server 2012 R2 vl2.0.4887.0. (5). 
RabbitMQ v3.6.6. and client PC with the following the 
specification: (1) Processor Intel Core i5 2.3GHz and Memory 
12GB, (2) Microsoft Windows 8 Operational system. 
Furthermore, this research uses VPN (Virtual Private Network) 
to communicate between the client and server. 


IV. Implementation and Discussion 


Implementation and test on this research is started by 
building REST web services that is divided into two stages, 
that is by using each of SQLServer and RabbitMQ as data 
source on the REST web service and implementing request 
flow engineering that is based on the result which is obtained 
from the first stage. The number of request that will be tested 
on the first stage are 200 data requests in parallel and the 
scheme that will be used in this implementation uses 3 layers 
concept. In this concept, REST web services will be located on 
the second layer. This part will receive the request and continue 
it to the third layer of SQLServer. Figure 2 explains 
implementation of the first stage system. 


1st layer- 


2nd layer -1 


3rd. layer 


' REST | 

’ ! ' 0 '~ tt® 


I 

SQLServer | 

I 


Fig 2. Implementation of the first stage system 


Request that is sent by the first layer will be received by REST 
web services and then it is continued to the SQLServer. 
Response from the SQLServer will be returned to REST web 
service and then it will be returned to the client. From request 
flow until response that is produced, it gives a picture that 
client will receive the response which is very dependent on the 
second and third layer. When delay response occurs on one of 
those layers then it will influence to the client side as well. 

Test data that is used on the request has a structure that 
consists of 3 properties; those are sessionid, companyid, and 
data. Sessionid on the structure is a property that is used to 
differentiate request that is sent based on the session and 
companyid is used to determine entity that is used, and data 
property shows the data that is contained in defining that data. 
Figure 3 shows example of data string that is sent to REST web 
services. 


i 

■Sessionid':"97007627530910751679035", 

•Q*praf: t AC31\ 

•Data":'AC30;200;;003021;20170727;0006;01;B;04;01;01;;;;R;6875;;;HARD-VG||||;31|24||||||; 

I||||||;2|€|10|12|||I;;;;RX;2;0.00;;;1;A730;;03;07;;50;;♦ 
AC30;300;;003021;20170727;0006;01;19;04;01;l>l;;;;L;4 5;;;HARD-VG| 111;31|24| HIM; 

1111111;2161101121111;;;;RX;2;0,00;;;1;A730;;03;07;;50;;I* 

|l 

Fig 3. Data text or string that is sent to REST web services. 

Data type that is used on the request which is sent is text or 
string and the format which is used is in the form of JSON 
(Java Script Object Notation) in which it is separated using 
comma (,) for every property that is possessed. Besides that, 
semicolon (;) is used to separate the sub-data into certain parts 
that will be mapped to be columns in the table. Hashtag (#) 
sign is used in the data property as a sign of a line that will be 
stored in the table. The more complex the content of the data 
property then the size of data text becomes bigger. Data size 
that is sent by client to REST web services is different, they are 
100Kb, 250Kb and 500Kb and data request that is fail and 
succeed will be counted to see how big the implementation of 
REST web service is if it is used on enterprise application. 
Table I and Figure 4 shows results of data request that are 
succeed and fail to be received by REST web services which 
uses SQLServer. 


TABLE I. Result of REST web services with SQLServer 


Message 

size 

Result of REST web services 


Failed 

Succeed 

Failed (%) 

Succeed 

(%) 

100Kb 

110 

90 

55 

45 

250Kb 

156 

44 

78 

22 

500Kb 

195 

5 

97.5 

2.5 



Fig 4. Result of REST web service with SQLServer 


Result of the first stage explains that the number of data that 
works is smaller being compared to the data that is able to be 
received. The bigger data size is directly proportional with the 
failure possibility in the data sending. REST web service faces 
significant performance decrease on each of data size with the 
request that is sent in parallel for 200 data requests in certain 
period. Therefore, it is better that the first stage modeling 
usage is used for the application with average number of 
request that is sent is under 200 requests and data size that is 
sent is under 100Kb. 

In order to solve the problem at the first stage, then a new 
layer is added between the second and the third layer. This 
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layer is filled with the placement of message-broker in which it 
will receive all requests that are sent by the client. The 
choosing of RabbitMQ as message-broker is based on the 
better performance of RabbitMQ in handling bigger data and 
high number of request. Besides that, RabbitMQ use on the 
implementation of the second stage is based on the better result 
which is showed by RabbitMQ with similar test condition that 
is conducted on the first stage. Result of RabbitMQ testing can 
be seen on Table II and Figure 5. 


TABLE II. Result of REST web services with RABBITMQ 


Message 

size 

Result of REST web services 


Failed 

Succeed 

Failed (%) 

Succeed 

(%) 

100Kb 

3 

197 

1.5 

98.5 

250Kb 

7 

193 

3.5 

96.5 

500Kb 

10 

190 

5 

95 



Fig 5. Result of REST web service with RabbitMQ 


Types and data size that are used is the same and appropriate 
with the first stage, that is 100Kb, 250Kb, and 500Kb on each 
of REST web services with the number of request for 10, 100, 
1000, 1500 and 2000 data request sequentially. Average 
response-time will be counted for each group of that REST 
web services. Test result for 10 data requests shows that REST 
web service which uses RabbitMQ has smaller response-time 
on average than REST web services that uses SQLServer. 
Average response time on both datasource can be seen at 
Table III and Figure 7. 

TABLE III. Result of REST web services response time (10 

REQUESTS) 


Average response time (ms) for various 
message size (100,250,500)Kb 

Client 

SQLServer 

RabbitMQ 

Client 1 

2547 

1252.67 

Client 2 

2274 

1299.33 

Client 3 

2000.67 

1416.33 

Client 4 

2156.67 

1075.33 

Client 5 

2132 

1126.67 

Client 6 

2259.67 

1026 

Client 7 

2088.33 

1109 

Client 8 

2026.33 

1075.33 

Client 9 

2317.67 

1291.67 

Client 10 

2110.67 

2214 


This result describes that the number of data that is received 
on REST web services which uses RabbitMQ is higher and it 
tends to be more stable than the number of the fail data. The 
number succeeded request increases although data size that is 
sent is bigger. Therefore, data integrity on REST web service 
which uses RabbitMQ is better than REST web service that 
uses SQLServer. 

Aside from data integrity, response time also a factor that is 
used to rate the performance on REST web services [3], [4], 
[6]. That test is conducted by building REST web services 
with the use of two different data sources and those two data 
sources are not connected to each other. Temporary scheme of 
REST web services implementation to test this thing can be 
seen on Figure 6. 


Average response time of REST web services 
(10 requests) 



REST web services - SQLServer 
REST web services - RabbitMQ 


Fig 7. Result of REST web services response time (10 requests) 


Besides that, the test on the group of 100, 1000, 1500 and 2000 
data request also show good result on RabbitMQ. The result 
that is obtained on this test shows that REST web services 
which uses RabbitMQ still has smaller response-time than 
REST web services that uses SQLServer. 


1st layer -1 2nd layer ( 3rd layer | 



Fig 6. Implementation Scheme of REST web service with 
SQLServer and RabbitMQ. 


table IV. Result of REST web services response time (10 
_ requests) _ 


Average response time (ms) for various 
message size (100,250,500)Kb 

Client group 

SQLServer 

RabbitMQ 

Group 100 

5552.94 

1816.26 

Group 1000 

6468.12 

1860.36 
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Group 1500 

7452.63 

1921.34 

Group 2000 

10413.5 

1963.29 


Average response times of REST web 
services (100,1000,1500,2000) requests 

12000 

*1 10000 

| 8000 

I 6000 

£ 

a 4000 

E 

s 

< 2000 

0 

Group 100 Group 1000 Group 1500 Group 2000 

Fig 8. Result of REST web services response time 
(100,1000,1500,2000 requests) 

Generally, the test on RabbitMQ usage on REST web service 
has better result than REST web service which uses 
SQLServer. This result covers data integrity and better 
response time. Therefore, changing or modification 
implementation of data request flow will be done on the 
second stage. This flow engineering will be based on the result 
of the previous test by locating RabbitMQ system between the 
layer of REST web service and SQLServer. Figure 9 explains 
flow engineering on the implementation of the second stage. 



lat layer, 2nd layer, 

1 1 

3rd 1ayer 

1 REST 1 

- L w -Mi 

Rabbi tUQ 

— ((((« 

i i 

| FILES | 



SQLServer 


Fig 9. Layer implementation scheme on the second 
stage. 


When client sends request to REST web services, data request 
will be continued to RabbitMQ system to be stored first. The 
data will be processed and stored in the data stmcture of 
RabbitMQ. Besides that, RabbitMQ system will conduct data 
backup processing in the form of flat file and this file type is 
locked by RabbitMQ system so that it cannot be opened 
directly by other systems except it goes through the protocol 
and mechanism of RabbitMQ itself. Backup process on 
RabbitMQ gives preventive step on data loss that is caused by 
other factors; one of them is crash-system. The next step from 
data flow which has been stored is by continuing that data to 
SQLServer. The response that will be generated by SQLServer 
will be returned to RabbitMQ and then it will be continued to 
REST web services to be responded to the client. Therefore, 
implementation of RabbitMQ on the second stage scheme is 
used as a bridge that connects REST web services with 
SQLServer. Result of RabbitMQ system performance creates 
positive impact on REST web services but it also does not lose 
momentum in the structured data processing that is owned by 


SQLServer. High number of request receiving and high data 
integrity also good validity influences performance of REST 
web services on the second stage implementation. Ideally, the 
number of request that can be received by REST web services 
is between 1500 until 2000 at the same time with the data on 
the request being queued on RabbitMQ system so that 
response process that is returned to the client side becomes 
better. Furthermore, in order to provide higher and better 
performance, clustering concept and technique can be applied 
on RabbitMQ. That concept may bring performance 
improvement on RabbitMQ system with smaller tolerant fault 
value [1]. 


V. Conclusion and Recommendation 

This research concludes that use of RabbitMQ on REST 
web services can bring positive impact for the performance of 
REST web service itself. RabbitMQ system is placed as a 
bridge to connect REST web services with SQLServer. It is 
because RabbitMQ system has better performance than 
SQLServer system in receiving high number of requests. 
Response time that is produced by RabbitMQ has smaller 
value than SQLServer with data text size that is sent is varied 
among 100Kb, 250Kb, and 500Kb. Besides that, RabbitMQ 
system performs better data integrity compared to SQLServer 
on the clarification of high number of requests. The value of 
small response time and high data integrity provide a reference 
on data flow engineering on the third layer in the early stage. 
It is done by inserting a new layer in form of Rabbit MQ 
system between REST web services layer and SQLServer 
layer. With this engineering, RabbitMQ can bridge over data 
flow from REST web service to SQLServer by keep 
maintaining the good performance of REST web services 
when there is acceptance on high request and big size data. In 
another side, this engineering creates impact which is not too 
good for the implementation of the second stage. 
Disadvantage on this engineering makes the use of processor 
server becomes big. The more data that is handled by 
RabbitMQ is directly proportional with the high use of the 
processor and it needs clustering concept use on RabbitMQ. 
These two impacts and other message-broker technologies use 
besides RabbitMQ can be the reference for the discussion 
topic for further researches. 
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Abstract — The main goal of Intrusion Detection Systems (IDSs) is 
to detect intrusions. This kind of detection system represents a 
significant tool in traditional computer based systems for ensuring 
cyber security. IDS model can be faster and reach more accurate 
detection rates, by selecting the most related features from the 
input dataset. Feature selection is an important stage of any IDs to 
select the optimal subset of features that enhance the process of the 
training model to become faster and reduce the complexity while 
preserving or enhancing the performance of the system. In this 
paper, we proposed a method that based on dividing the input 
dataset into different subsets according to each attack. Then we 
performed a feature selection technique using information gain 
filter for each subset. Then the optimal features set is generated by 
combining the list of features sets that obtained for each attack. 
Experimental results that conducted on NSL-KDD dataset shows 
that the proposed method for feature selection with fewer features, 
make an improvement to the system accuracy while decreasing the 
complexity. Moreover, a comparative study is performed to the 
efficiency of technique for feature selection using different 
classification methods. To enhance the overall performance, 
another stage is conducted using Random Forest and PART on 
voting learning algorithm. The results indicate that the best 
accuracy is achieved when using the product probability rule. 

Keywords-Intrusion Detection Systems, NSL-KDD, Feature 
Selection, Supervised Learning, Classification. 

I. Introduction 

Wireless sensor networks (WSNs) comprise of tiny sensor 
nodes or devices that have radio, processor, memory; battery as 
well as sensor hardware. The widespread deployment of these 
sensor nodes makes it possible for environmental monitoring. 
These small devices are resource inhibited in terms of the speed 
of the processor, the range of the radio, memory as well as 
power. This nature of resource inhibition makes designers 
design systems that are application specific. While the Wireless 
Sensor Networks are not protected, and the transmitted medium 
is wireless, this raises the vulnerability to attacks. WSNs are 
being gradually embraced also in applications which are very 
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sensitive for instance in detection of forest fires [1], power 
transmission as well as distribution [2], localization [3], 
applications of the military [4], Critical-infrastructures (CIs) [5] 
and Underwater Wireless Sensor Networks (Underwater WSNs) 
[ 6 ], 

Lack of proper security measures can lead to launching of 
different types of attacks in environments that are hostile. These 
kinds of attacks can interrupt the WSNs from working normally 
and can defeat the deployment’s purpose. Consequently, 
security is a significant networks feature. The shortage of means 
makes the creators use primitives of security which are 
traditional such as encryption and one-way functions cautiously. 
Detection of intrusion is seen as the defense’s second line which 
matches the security primitives. For practicality in implementing 
WSNs, intrusions detection ideas need to be lightweight, 
scalable as well as distributed. This paper proposes such 
approaches in the detection of anomaly intrusion in WSNs. In 
this kind of context, it is very important to make sure that there 
is the protection of the sensor network from threats emanating 
from cyber-security. Regrettably, the achievement of this 
objective is a bit of a challenge due to features number of WSNs, 
highest important one being: inadequate computational 
resources, inhibiting the execution of robust mechanisms that are 
cryptographic; and their distribution in environments that are 
wild and unattended, where it is possible for the enemy to access 
the sensor nodes physically, for instance, reading cryptographic 
keys straight from the memory. The fast technology 
development over the Internet makes the security of a computer 
serious issue. Currently, Intelligence which is artificial, data 
mining as well as machine learning algorithms are exposed to a 
broad investigation in ID with stress on enhancing the detection 
accuracy as well as create a model that is immune for IDS. In 
addition to detection abilities, IDSs also offers extra 
mechanisms, for instance, diagnosis as well as prevention. 
Wireless sensor networks’ IDSs architectures are presently 
being examined and various solutions have been recommended 
in the research. 
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This paper concentrates on building IDs for WSN. To 
construct an Intrusion Detection System model quicker with 
more correct rates of detection, choice of features that are vital 
from the input dataset is extremely important. Learning 
process’s feature selection while designing the model indicates 
a decrease in computational rate and improves precision. The 
main objective of this paper is determining the greatest suitable 
features to use in the identification of attack in a dataset of NSL 
KDD as well as WEKA [7] tool is used for analysis. Different 
performance metrics are used to assess the performance of each 
classifier such as: precision, recall, F-measure, false positive 
rate, overall accuracy (ACC) and ROC curve. NSL KDD dataset 
[8] is a common dataset for revealing of the anomaly, 
particularly for identifying the intrusion. This dataset comprises 
of forty-one features that resemble different types of the network 
traffic. The network traffic is divided into dual classes, one being 
the normal class while the other is referred to as the anomaly 
class. The anomaly class usually depicts intrusions or attacks 
that originate from the network at the time of taking records for 
the network traffic. In relation to these attacks, the NSL KDD 
dataset is additionally categorized into four main attack 
classifications such as the DoS, in addition to probing. Further 
classifications comprise of users to root (U2R), as well as remote 
to local (R2L). The DoS attack renders the unavailability of 
crucial services to genuine users through the bombardment of 
the attack packets that are found on the computing and also on 
network resources. Instances of DoS attacks contain backland, 
and smurf. Moreover, teardrop, plus neptune attacks are also 
examples of such attacks. Due to the high levels of the risks that 
are found in other types of the DoS attacks that relate to 
computer expenses, the paper primarily dealt on the DoS attacks, 
as stated in the 2014 document [9]. A DoS attack is viewed as a 
major concern for authentic operators retrieving services 
through the Internet. DoS attacks render the unattainability of 
services to users through limiting network and also the system 
resources. While a lot of investigation has been performed by 
dint of network security professionals to defeat the DoS attack 
concerns, DoS attacks are still on the rise and have a more 
significant detrimental influence as time passes. 

The organization of the paper as following. Section 2 
presents an intrusion detection overview, reviews related work. 
Section 3 describes IDS proposed model, and Sect. 4 is analysis 
the experimental results obtained. Finally, Section 5 states the 
conclusions. 

II. Literature Review 

Intrusion detection system uses machine learning algorithms 
or classifiers to learn system normal or abnormal behavior and 
build models that help to classify new traffic. Developing an 
optimal machine learning based detection systems directs 
research to examine the performance of a single machine 
learning algorithm or multiple algorithms to all four major attack 
categories rather than to a single attack category. Some of the 
algorithms and methods used by the researchers in this filed will 
be mentioned. Also, we will try to focus on the researches that 
used NSL-KDD for analyzing their experimental results. 

Hota and Shrivas, 2014 [10], proposed a model that used 
different feature selection techniques to remove the irrelevant 
features in the dataset and developed a classifier that is more 


robust and effective. The methods that were used combined with 
classifier are Info Gain, Correlation, Relief and Symmetrical 
Uncertainty. Their experimental work was divided into two 
parts: The first one is building multiclass classifier based on 
various decision tree techniques such as ID3, CART, REP Tree, 
REP Tree and C4.5. The second one is applying feature selection 
technique on the best model obtained which was here C4.5. 
Their experimental analysis was conducted using WEKA tool. 
The results showed that C4.5 with Info Gain had better results 
and achieved highest accuracy of 99.68% with only 17 features. 
However, in case of using 11 features, Symmetrical Uncertainty 
achieved 99.64% accuracy. 

Deshmukh, 2014 [11], developed IDS using Naive Bayes 
classifier with different pre-processing methods. Authors used 
NSL-KDD dataset and WEKA for their experimental analysis. 
They compared their results with other classification algorithms 
such as NB TREE and AD Tree. The results showed that with 
respect to the TP rate of all algorithms, the execution time of 
Naive Bayes is less. 

Noureldien Yousif, 2016 [12], examined the performance of 
seven supervised machine learning algorithms in detecting the 
DoS attacks using NSL-KDD dataset. The experiments were 
conducted by using for training step the Train+20 percent file 
and for testing using Test-21 file, they used 10-fold cross 
validation in test and evaluate the methods to confirm that 
techniques will achieve on undetected data. Their results showed 
that Random Committee was the best algorithm for detecting 
smurf attack with accuracy of 98.6161%. At the average rate, the 
PART algorithm was the best for detecting the Dos Attacks, 
however, Input Mapped algorithm was the worst. 

Jabbar and Samreen, 2016 [13], have presented a novel 
approach for ID using alternating decision trees (ADT) to 
classify the various types of attacks while it is usually used for 
binary classification problems. The results showed that their 
proposed model produced higher detection rate and reduces the 
false alarm rate in classification of IDS attacks. 

Paulauskas and Auskalnis, 2017 [14], analyses the initial 
data pre-processing influence on attack detection accuracy by 
using of ensemble, that are depend on the idea of combining 
multiple weaker learners to create a stronger learner, model of 
four different classifiers: J48, C5.0, Nai've Bayes and PART. 
Min-Max normalization as well as Z-Score standardization was 
applied in pre-processing stage. They compared their proposed 
model with and without pre-processing techniques using more 
than one classifier. Their results showed that their proposed 
classifier ensemble model produces more accurate results. After 
they presented their results, they were warned not to use only the 
NSL-KDDTrain+ dataset for both training and testing because 
even without pre-processing methods, it leads to get 99% of 
accuracy. Therefore, NSL-KDDTest+ dataset must be used for 
model assessment. In this case the performance of the real model 
can be tested to detect a new type of attack. 

Wang, 2017 [15], suggested an SVM based intrusion 
detection technique that considers pre-processing data utilizing 
converting the usual attributes by the logarithms of the marginal 
density ratios that exploits the classification information that is 
included in each feature. This resulting in data that has high 
quality and concise which in turn achieved a better detection 
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performance in addition to reducing the training time required 
for the SVM detection model. 

Yin, et al., 2017 [16], have explored how to model an IDS 
based on deep learning approach using recurrent neural 
networks (RNN-IDS) because of its potential of extracting better 
representations for the data and create better models. They pre- 
processed the dataset using Numericalization technique because 
the input value of RNN-IDS should be a numeric matrix. The 
results showed that RNN-IDS has great accuracy rate and 
detection rate with a low false positive rate compared with 
traditional classification methods. 

Feature selection as a vital part of any IDS can assist make 
the procedure of training the model less multifaceted and faster 
while preserving or even enhancing the total performance of the 
system. Shahbaz et al. [17] suggested an efficient algorithm for 
feature selection by considered the correlation between the 
behavior class label and a subset of attribute to resolve the 
problem of dimensionality lessening and to defining good 
features. The outcomes revealed that the proposed model has 
considerably minimal training time while preserving accuracy 
with precision. Additionally, several feature selection methods 
are tested with varying classifiers regarding the detection rate. 
The comparison outcomes reveal that J48 classifier 
accomplishes well with the proposed feature selection method. 

Similarly, the study in [18] proposed a new intelligent IDS 
that works on reduced number of features. First, authors perform 
feature ranking on the basis of information-gain and correlation. 
Feature reduction is then done by combining ranks obtained 
from both information gain and correlation using a novel 
approach to identify useful and useless features. These reduced 
features are then fed to a feed forward neural network for 
training and testing on KDD99 dataset. The method uses pre¬ 
processing to eliminate redundant and irrelevant data from the 
dataset in order to improve resource utilization and reduce time 
complexity. The performance of the feature reduced system is 
actually better than system without feature reduction. According 
to the feature optimization selection problems of the rare attack 
categories detection the researchers in [19] used the cascaded 
SVM classifiers to classify the non-rare attack categories and 
using BN classifiers to classify rare attack categories, combining 
with cascaded GFR feature selection method (CGFR) The 
experimental results showed that the CGFR feature selection is 
effective and accurate in IDS. 

Redundant as well as irrelevant characteristics in data have 
resulted in a constant problem in network traffic classification. 
To combat this concern, Ambusaidi et al. [20] offered a 
supervised filter-based feature selection algorithm that 
methodically picks the ideal feature for categorization. The 
Flexible Mutual Information Feature Selection (FMIFS) that has 
been proposed to lessen the redundancy among features. FMIFS 
is then combined with the Least Square Support Vector Machine 
based IDS(LSSVM) technique to develop an IDS. The role of 
the model is appraised by means of three intrusion identification 
datasets, that is to say, KDD Cup 99, NSL-KDD plus Kyoto 
2006+ datasets. The appraisal outcomes revealed that 
characteristic selection algorithm gives other essential 
characteristics for LSSVM-IDS to accomplish enhanced 


accurateness and lessen computational expenses in contrary to 
the state-of-the-art techniques. 

Ikram and Cherukuri,2017 [21], proposed an ID model using 
Chi-Square attribute selection and multi-class support vector 
machine (SVM). The main idea behind this model is to construct 
a multi class SVM which has not been adopted for IDS so far to 
decrease the training and testing time and increase the individual 
classification accuracy of the network attacks. 

In [22], Khammassi and Krichen have applied a wrapper 
methods based on a genetic algorithm as a search strategy and 
logistic regression as a learning algorithm for network IDSs to 
choice the best subset of features. The proposed approach is 
based on three stages: a pre-processing phase, a feature selection 
phase, and a classification stage the experiment will be 
conducted on the KDD99 dataset and the UNSW-NB15 dataset. 
The results showed that accuracy of classification equal to 99.90 
%, 0.105 % FAR and 99.81% DR with a subset of only 18 
features for the KDD99 dataset. Furthermore, the selected subset 
provides a good DR for DoS category with 99.98%. The 
obtained results for the UNSW-NB15 provided the lowest FAR 
with 6.39% and a good classification accuracy compared to the 
other mentioned approaches with a subset composed of 20 
features. 

From this inspiration, we are trying to find out which of 
classification algorithms that we select will give better results 
after selecting the features that have a strong correlation in the 
training dataset. In this work, researchers will try to conduct 
some experiments to differentiate and discover the normal and 
abnormal behavior. 

III. Proposed IDS Methodology 

The main goal of the research, is to build a framework of 
intrusion detection with minimum number of features in the 
dataset. The previous researches showed that only a subset of 
these features is related to ID. So, the aim is to reduce the data 
set dimensionality to build a better classifier in a reasonable 
time. The proposed approach consists of four main phases: The 
first phase is to select the related features for each attack using 
feature selection method. Then combining the different features 
to obtain the optimal set of features for all attacks. The final set 
of features is fed to the classification stage. Finally, the model is 
tested using a test dataset. The framework of the proposed 
methodology is shown in Fig. 1. 

A. Selecting the Related Features for Each Attack 

While the network intrusion system deals with a large 
amount of raw data, the feature selection is becoming a basic 
step in building such system. Feature selection is related to a 
number of methods and techniques that are used to eliminate the 
irrelevant and redundant features. The dimensionality of the data 
set has a big effect in the model complexity that leads to low 
classification accuracy, and high computational cost and time. 
The aim of these methods also is to select the optimal features 
which will enhance the model’s performance. There are two 
general categories of methods for feature selection, filter 
methods and wrapper methods [23]. In the Filter algorithms an 
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Figure 1. Framework of The Proposed Model of IDS 


independent measure is utilized (such as, information, distance, 
or consistency) which are used to estimate the relation of a set of 
features, while wrapper algorithms use of one of learning 
algorithms to make the evaluation of the feature’s value. In this 
study, Information Gain (IG) will be used to select the subset of 
related features. IG is often cost less and faster than the wrapper 
methods. 

Information gain is computed for each individual attribute in 
the training dataset related to one class. If the ranked value is 
high that means a feature is highly distinctive this class. 
Otherwise if the value is less than the predetermined threshold, 
it will be removed from the feature space. To obtain a better 
threshold value, the distribution of the IG values is examined and 
tested with different threshold values on the training dataset. 

The IG of a feature t, overall classes is known by equation 

( 1 ). 

m 

lG(t) = - £p(c ( ) logp( C( ) 

i=1 

m 

+ P(t)^P(CiV)logpfeV) 

i =l 
m 

+ pit ) ^ p(Ci\t ) logp(Cj\t) (1) 

i=1 

Where: 

• c t represents (i) category. 

• P(c;): probability that a random instance document 
belongs to class c t . 

• P(t) and P(t ) probability of the occurrence of the 
feature w in a randomly selected document. 

• P(c;|t): probability that a randomly selected document 
belongs to class c t if document has the feature w. 

• m is the number of classes. 


The selection features stage for each attack is divided into 
three main steps as follows: 

Stepl: The training dataset is divided into 22 datasets. 
Each dataset file contains the records of one attack records 
merged with the normal records. If the whole dataset is used 
without splitting, then the selection features method will be 
biased to the most frequent attacks. So, this step is essential 
to obtain more accurate results. 

Step2: Each file then is used as an input to IG method to 
select the most relevant features of that attack. For example, 
the spy attack has the related features ranked as shown in 
Table 1. 

Step3: A ranked feature list is generated, and according 
to some thresholds, a number of features are eliminated. 
From the list in Table I, it can be noticed that the most 
relevant features for spy attack are features 38 and 39, if we 
take the threshold equal to 0.003. So, we can take the best 
two features and eliminate the others. 


TABLE I. Spy Ranked Related Features 


Ranked Value 

Feature 

Number 

Feature Name 

0.004029 

38 

dst host serror rate 

0.0036057 

39 

dst host srv serror rate 

0.0018171 

3 

Service 

0.0012618 

18 

num shells 

0.0011184 

15 

su attempted 

0.0008256 

19 

num access files 

0.0001008 

2 

protocol type 


B. Combining the Different Set of Features for All Attacks 

In this step, a combined list of features for all attacks is 
generated from the obtained subsets. For some attacks the 
highest rank of the first three features are selected. But for 
another set of attacks, like land attack, one feature has been 
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taken, since it’s rank is equal to 1, while the ranks for other 
features were very low. That means this feature can fully 
discriminate this attack. 

C. Classification of the Training Dataset 

The final combined subset is used as an input to the 
classification stage. The results of three different classifiers have 
been considered to make the comparative study. These 
classifiers are J48, Random-Forest (RF) and Partial Decision 
List (PART). After conducting the experiments, the best two 
classifiers results are chosen. The next step, is to use the vote 
ensemble method to enhance the performance of the model. 

• J48 classifier: C4.5 (J48) is an algorithm developed by 

Ross Quinlan that used to generate a decision tree. This 
algorithm becomes a popular in classification and Data 
Mining. The gain ratio method is used in this algorithm 
as a criterion for splitting the data set. Some 
normalization techniques are applied to the information 
gain using a “split information” value. 


• Random Forest: is related to a machine learning 
method which makes a combination between decision 
tree and ensemble methods. The input of the forest that 
represent the features are picked randomly to build the 
composed trees. The generation process of the forest 
constructs a collection of trees with controlled variance, 
majority voting or weighted voting can be used to decide 
the resulting prediction. 

• Partial Decision List (PART): PART is an algorithm 
of decision-list based on partial decision tree, joining 
the advantages of both classifier C4.5 and PIPPER. A 
pruned decision tree is created for all existing instances, 
for the leaf node building a rule corresponding with the 
largest coverage, after that discarding the tree and 
continuing. 

• Ensemble classifier: An ensemble classifier consists of 
the combination of multiple weak machine learning 
algorithms (known as weak learners) to improve the 
classification performance. The combination of weak 
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Figure 2. Distribution of Attacks in NSL-KDDTrain+ 
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Figure 3. Distribution of Attacks in NSL-KDDTest+ 
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learners can be based on different strategies such as 
majority vote, boosting, or bagging. 

D. Testing the Model 

In this stage, a test dataset KDD-Test is used to evaluate the 
model which has been generated by the vote ensemble method. 
The test dataset file is different from the training dataset and has 
an extra number of attacks. After that the performance 
evaluation of the model is conducting using some measures such 
as accuracy, and area under the ROC. 

IV. Results and Analysis 

In this section, experiments results analysis is discussed. All 
experiments were conducted using platform of Windows with 
configuration of Intel® core™ i7 CPU 2.70 GHZ, 8 GB RAM. 
WEKA tool was used to evaluate the method and perform 
feature selection. In order to select the optimal training 
parameters, a 10-fold cross validation (CV) is performed on the 
training dataset. 

A. Dataset Description 

All experiments are carried out on NSL-KDD datasets [8]. 
NSL-KDD is a refined version of the KDD’99 dataset. It 
overcomes some inherent problems in the original KDD dataset. 
Redundant records in the training set have been removed so that 
the classifiers produce unbiased results. There is no duplicate 
data in the improved testing set. Therefore, the biased influence 
on the performance of the learners has been significantly 
reduced. Each connection in this dataset contains 41 features. 
Researchers in this work carry out the experiments using the 
KDDTrain and KDDTest data. The different attacks are listed in 
Table II. The Distribution of Attacks in NSL-KDDTrain+ and 
NSL-KDDTest+ files are shown in Fig 2 and Fig 3. 

TABLE II. Attacks in NSL KDD Training Dataset 


Attack Type 

Attack Name 

DOS 

Neptune, Smurf, Pod, Teardrop, Land, Back 

Probe 

Port-sweep, IP-sweep, Nmap, Satan 

R2L 

Guess-password, Ftp-write, Imap, Phf, 
Multihop, spy, warezclient, Warezmaster 

U2R 

Buffer-overflow, Load-module, Perl, Rootkit 


B. Evaluation Metrics 


The performance evaluation of the proposed model, used 
different performance metrics such as: precision (equation 2), 
recall (equation 3), F-measure (equation 4), true negative rate, 
false positive rate and overall accuracy (ACC) (equation 5) that 
known as correctly classified instances (CC). In addition, 
presented Received Operating Characteristics (ROC) of the 
system. The ROC curve is computed by drawing the relation 
between true positive rate and false positive rate in y-axis and x- 
axis, respectively. 

tp 

Precision = pp + pp (2) 


Recall = 


TP 

TP + FN 


(3) 


2 x Precision x Recall 
Recall + Precision 


( 4 ) 


Accuracy = 


Number of Correct Classified Connections 
Number of Connections 


x 100% 


(5) 


Where: 

• TP: related to the true positive. 

• FP: related to the false positive. 

• FN: related to the false negative. 

C. Results Analysis 

After making many experiments on the combined list. The 
optimal number of combined features is equal to 28 features. 
These features as well as its number in the DS are listed in Table 
III. 


TABLE III. The Final Selected Features 


Feature Number 

Feature Name 

1 

duration 

2 

protocol type 

3 

services 

4 

flag 

5 

src bytes 

6 

dst bytes 

7 

land 

8 

wrong fragment 

9 

urgent 

10 

hot 

11 

num failed logins 

13 

Num compromised 

14 

Root shell 

17 

num file creations 

18 

num shells 

19 

num access files 

26 

srv serror rate 

29 

same srv rate 

30 

diff srv rate 

31 

srv diff host rate 

32 

dst host count 

33 

dst host srv count 

34 

dst host same srv rate 

36 

dst host same src port ra 

37 

dst host srv diff host rat 

38 

dst host serror rate 

39 

dst host srv serror rate 

41 

dst host srv rerror rate 


In Table IV, comparing the accuracy and different evaluation 
metrics with two sets of attributes against using the all dataset 
with 41 attributes according to PART classifier with two test 
option cross validation and NSF-KDD Test +. As observed, for 
the accuracy is shown. The performance of proposed technique 
compared in terms of using cross validation test and testing 
dataset. The result shows that high accuracy with (99.7984%) is 
obtained when using set of 19 feature with cross validation test, 
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while using 28 features, the accuracy is (86.66%) when using 
NSL-KDD Test + dataset. 

On the other hand, the results of the comparison between the 
performance of three classification algorithms with the proposed 
method, and both CV and testing are presented in Table V. 

As a comparison, we used various popular classifiers 
algorithms. These classifiers are J48, Random-Forest (RF) and 
Partial Decision List (PART). The highest testing accuracy with 
(86.66%) is achieved by PART algorithm, whereas the highest 
obtained accuracy from CV with (99.78%) by using RF. Fig. 4 
shows a comparison of classification algorithms in term of 
accuracy with test option cross validation and NSL-KDD Test 
+. According to these results, the best two classifiers (PART and 
RF) have been chosen to manipulate the voting ensemble 
algorithm. Table VI demonstrates the performance of using 
voting learning algorithm for Random Forest and PART to 
improve the obtained accuracy for the system of intrusion 
detection. It was noticed that, when Random Forest and PART 
classifiers are used under different combination methods, the 
accuracy of the model is enhanced. Table VI shows also that the 
accuracy in CV is the same while using the three rules. But when 
the supplied test dataset is being used, a different behavior is 
noticed for the three rules. The best accuracy is achieved when 
using the product probability rule. Finally, the area under the 
ROC curves as shown in Fig. 5 is calculated for each attack 
classes in the dataset based on cross validation and NSL-KDD 
Test. The results also show that, the ROC values for DoS and 
probe attacks are almost the same in the two test options, but the 
values fluctuate with R2L and U2R attacks. 

ACCURACY RESULTS OF THREE 
CLASSIFIERS USING CROSS VALIDATION 
AND NSL-KDDTEST + 

(28 ATTRIBUTES) 

■ Testing-ACC ■ Cross Validation 



ROC AREA 

1.2 



Figure 5. Final ROC Area for each Class for CV and NSL-KDDTest+ 


V. Conclusion and Future Work 

IDS is used to secure the computer based systems against a 
lot of cyber-attacks. Feature selection at the beginning stage of 
machine learning approach has proven to enhance the detection 
performance. In the research, we have proposed feature selection 
approach using information gain methods that was calculated for 
each attack in the NSL-KDD dataset to identify the optimal 
feature set for each presented attack and select these features 
according to some thresholds. Then combining the feature list 
for all attacks. The experiment result shows that the highest 
accuracy obtained when using Random Forest and PART 
classifiers under combination methods namely the product 
probability rule. 

As a future work, it is suggested to use the adaptive boost 
learning algorithm in the feature selection stage instead of using 
IG. This will increase the efficiency of the detection system. 


J 48 RANDOM-FOREST PART 


Figure 4. Accuracy Results of Three Classifiers 


TABLE IV. Results with Different Number of Features Using PART 


Feature 

set 

Test Option 

Correctly 

Classified 

Incorrectly 

Classified 

Accuracy 

TP 

FP 

Precision 

Recall 

F- 

Measure 

ROC 

Area 

19 

Cross Validation 

125719 

254 

99.7984 % 

0.998 

0.001 

0.998 

0.998 

0.998 

0.999 

NSL-KDD Test + 

16231 

2563 

86.3627 % 

0.864 

0.124 

0.794 

0.864 

0.814 

0.856 

28 

Cross Validation 

125701 

272 

99.7841 % 

0.998 

0.001 

0.998 

0.998 

0.998 

0.999 

NSL-KDD Test + 

16287 

2507 

86.6606 % 

0.867 

0.108 

0.850 

0.867 

0.823 

0.880 

41 

Cross Validation 

125714 

259 

99.7944 % 

0.998 

0.001 

0.998 

0.998 

0.998 

0.999 

NSL-KDD Test + 

16283 

2511 

86.6394 % 

0.866 

0.124 

0.881 

0.866 

0.818 

0.857 
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TABLE V. Cross-Validation and Test Results of Three Classifiers 


Classifier 

Name 

Test Option 

Correctly 

Classified 

Incorrectly 

Classified 

Accuracy 

TP 

FP 

Precision 

Recall 

F- 

Measure 

ROC 

Area 

J48 

Cross Validation 

125644 

329 

99.7388 % 

0.997 

0.002 

0.997 

0.997 

0.997 

0.999 

NSL-KDD Test + 

16178 

2616 

86.0807 % 

0.861 

0.119 

0.774 

0.861 

0.814 

0.840 

Random- 

Forest 

Cross Validation 

125785 

188 

99.8508 % 

0.999 

0.001 

0.998 

0.999 

0.998 

1.000 

NSL-KDD Test + 

16259 

2535 

86.5117% 

0.865 

0.112 

0.831 

0.865 

0.819 

0.943 

PART 

Cross Validation 

125701 

272 

99.7841 % 

0.998 

0.001 

0.998 

0.998 

0.998 

0.999 

NSL-KDD Test + 

16287 

2507 

86.6606 % 

0.867 

0.108 

0.850 

0.867 

0.823 

0.880 


TABLE VI. Cross-Validation and Test Results Using Vote Method With (RF+PART) 


Combination 

Rule 

Test Option 

Correctly 

Classified 

Incorrectly 

Classified 

Accuracy 

TP 

FP 

Precision 

Recall 

F- 

Measure 

ROC 

Area 

Majority 

Cross Validation 

125743 

230 

99.8174% 

0.998 

0.001 

0.998 

0.998 

0.998 

0.999 

Voting 

NSL-KDD Test + 

16292 

2502 

86.6872 % 

0.867 

0.108 

0.850 

0.867 

0.823 

0.847 

Product 

Cross Validation 

125737 

225 

99.8127% 

0.998 

0.001 

0.998 

0.998 

0.998 

0.999 

probability 

NSL-KDD Test + 

16294 

2496 

86.6979 % 

0.867 

0.108 

0.851 

0.867 

0.823 

0.884 

Average 

Cross Validation 

125743 

230 

99.8174% 

0.998 

0.001 

0.998 

0.998 

0.998 

1.000 

probability 

NSL-KDD Test + 

16292 

2502 

86.6872 % 

0.867 

0.108 

0.850 

0.867 

0.823 

0.947 
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A Case Study of Soft Computing Model for Wheat 
Production Management 
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Abstract: Computer based dissemination of agricultural 
information, expert Systems and decision support systems 
(DSS) play a pivotal role in sustainable agricultural 
development. The adoption of these technologies requires 
knowledge engineering in agriculture. Diversification in 
application, spatio-temporal variation, and uncertainty in 
environmental data pose a challenge for knowledge 
engineering in agriculture. Wheat production management 
decision in Pakistan requires acquisition of spatio temporal 
information, capturing inherent uncertainty of climatic data 
and processing information for possible solution to problems. 
In this paper a frame work for engineering of knowledge base 
and soft computing model for production management of 
wheat crop is presented The frame work include an ontology 
based knowledge representation scheme along with structured 
rule based system for query processing. A soft computing 
model for acquisition and processing of wheat production 
information for decision support is presented along with 
knowledge delivery through semantic web. 

Key Words: Ontology, Knowledge Engineering, 
Agriculture, Semantic Web, Rule Based System 

I. INTRODUCTION 

Knowledge Engineering [Darai, 2010] is the 
aspect of system engineering which addresses solution 
of problems in uncertain process by emphasizing the 
acquisition of knowledge and representing it in a 
Knowledge-based System. KE is defined by Edward 
Feigenbaum and Pamela McCorduck (1983) as an 
engineering discipline that involves integrating 
knowledge into computer systems in order to solve 
complex problems normally requiring a high level of 
human expertise. Knowledge engineering is a field 
within artificial intelligence that develops knowledge- 
based systems. Such systems are computer programs 
that contain large amounts of knowledge, rules and 
reasoning mechanisms to provide solutions to real 
world problems. 

Artificial Intelligence (AI) is the area of 
computer science which focuses on developing 
machines and computer systems requiring intelligence 


Jawed Naseem 2 
SSITU, SARC 

Pakistan Agriculture Research Council 
Karachi, Pakistan 


like humans being. Using AI techniques and methods 
researchers are creating systems which can mimic 
human expertise in any field of science. Application of 
AI ranges from creating robots to soft computing 
models (softbot) that can reason like human expert and 
suggest solutions to real life problems. AI can be used 
for reasoning on the basis of incomplete and uncertain 
information and delivering predictive knowledge. 

Sustainable agricultural development requires 
adaptation and incorporation of newly developed 
technology to enhance agricultural production [Khan, 
2010]. The technology may involve development of 
new varieties, agricultural production management, 
water management or crop protection. Adaptation of 
these technologies involves continuous updating of 
knowledge regarding a particular technology and 
processing of information by expert to deliver 
appropriate solution to problem in agriculture 
[Hoogenboom 1999]. High level human expertise in 
agriculture, like other disciplines of science, are not 
only scares but also costly. Beside this expert 
knowledge in agriculture require mass dissemination of 
information to large audience of end-users including 
policy makers, researcher, extension people and 
ultimately to farmer. Conventional means of 
communication of agricultural information have limited 
scope of knowledge acquisition, processing and instant 
delivery to end-user [Khan, 2010]. Computer based 
systems and soft computing models can provide an 
effective and efficient knowledge management system. 
[Kolhe 2011]. Knowledge engineering in domain of 
agriculture comprises three basic component, 
Knowledge acquisition & representation, information 
processing and delivering of possible solution. 
Ontology is an effective way of knowledge 
representation in domain of application [Burney and 
Nadeem 2012] 

In this paper a case study for knowledge 
engineering of wheat production management decision 
support system is discussed. Section II discuss 
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challenges and issues in wheat production management, 
Section III describes mechanism for knowledge 
representation of wheat production management 
information. Section VI discuss mapping of wheat 
production technology information into a knowledge 
base, Section V present soft computing model and 
section VI discuss future work. 

II DECISION MAKING IN WHEAT 
PRODUCTION MANAGEMENT 

Wheat production management requires decision 
making on several factors from pre-cultivation to 
harvesting based on different parameters (Table-1) 
Wheat production technology constitutes spatio- 
temporal variation. In Pakistan wheat varieties are 
developed which are suitable for different agro 
ecological zones and have varied set of decision 
parameters. The main parameters are selection of 
appropriate variety, agronomic practices (planting date, 
seed rate etc), and irrigation management during 
cultivation. The yield of crop is not only affected by 
these factors but management of diseases and pests 
subject to environmental conditions contribute to 
growth of plant. Capturing spatio-temporal variation in 
wheat production technology is a challenge in 
development of wheat production knowledge base. The 
technological information varies depending upon 
various agricultural zones (spatial) along with time of 
adoption of the technology (temporal). The pest and 
disease monitoring is another essential component in 
wheat plant growth. Diagnosis of disease along with 
intensity of attack depends upon certain factors 
including environment. Incomplete information and 
dynamic changes in data contributes to uncertainty. 
Therefore capturing inherent uncertainty is another 
challenge in design issue of KE in wheat production 
knowledge base. However, AI method and techniques 
can address these issues and probabilistic reasoning is 
one of the options. Wheat production technology in 
Pakistan, based on factor indicated in (Table-1), has 
been developed and available through many sources. 
The agricultural technology is constantly changing as 
result of continuous research. Sharing of newly 
generated knowledge and updating is also an essential 
component in agricultural information management. 

Decision making in wheat production 
management starts from pre-cultivation till the 
harvesting of crop (Fig-1). In pre cultivation selecting 
appropriate wheat variety suitable for particular 
agricultural zone is required along with the prevailing 
cropping system. 

Cultivation management is more dynamic as 
compare to pre cultivation and post harvest 
management. The critical decision involved appropriate 
planting date, seed rate at time of sowing, fertilizer 


application and irrigation management. During 
cultivation management continuous updating of 
environmental parameters (temperature, humidity) 
affect incidence pest & disease which is make or break 
factor in crop yield. 




Consideration | 

s 

# 

Decision 

Domain 

Environmental 

Biological 

1 

Crop & 

Cultivar 

selection 

Temperature 
growing season, 
soil conditions 

Crop Adaptation, 
Pest resistance 

2 

Land 

Preparation 

Crop timing 
and Methods 

Soil Temperature 
and moisture, Soil 
temperature, 
humidity 

Soil Biology 
Germination, 
Emergence, 
growth rate 

3 

Irrigation 

Management 

Rainfall amount 
and distribution, 
soil moisture, 

Water required 
/available by 
crop, water use 
efficiency, 

4 

Fertility 

Management 

Soil chemical 
condition, soil 
moisture and 
aeration, soil 
temperature 

Soil/plant 

Nutrient, uptake, 
Growth rate, crop 
residue 

contribution to 
subsequent crop 

5 

Pest 

management 

Temperature, 
humidity, rainfall 

Weed, Insect and 
disease 
population, 
Population of pest 
predators or 
parasite 

6 

Harvest 

Timing and 
Methods 

Temperature, 
rainfall, light 
intensity, humidity 

risk of loss due to 
over maturity, or 
pest damage 


Table-1 Parameter affecting Crop Production Decisions 


Similarly timely application of water and fertilizer 
along with monitoring of the growth stages of wheat 
crop is required to achieve optimum yield of wheat. 


Pre Cultivation 


Agro-Ecological Zone 
Cropping System 
Selection of Variety 


Cultivation Management 


Land Preparation 

Growth Monitoring 

Irrigation Management 

Sowing Time 


| Fertilizer Application 

Seed Rate 


Pest and Disease Management 

n 


Post Harvest Management 



Storage Method 


Pest monitoring 

] 


Decision Requirement in Wheat Production Management 


Fig-1 Decision Requirement in Wheat production 
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III KNOWLEDGE REPRESENTATION IN WHEAT 
PRODUCTION MANAGEMENT 

Reasoning on the basis of available fact and getting 
required solution requires that facts are represented in 
appropriate form. Knowledge representation is the field 
of artificial intelligence (AI) devoted to representing 
information about the domain of interest, like 
agriculture, in a form that a computer system can utilize 
to solve complex tasks such as diagnosing plant disease 
or structuring rules to classify information. One of the 
techniques of knowledge representation is using 
ontology [Natalya F] [Burney and Nadeem 2012]. 
Ontology defines the terms and concepts commonly 
used in a particular domain. Therefore, ontology 
development is a process of representing terms, 
concepts and relationship in domain of interest. The 
main advantage of ontology representation is, it 
provides an explicit conceptual standard that can be 
shared and commonly used to describe the information 
in a domain of interest. 

The use of ontology can be undertaken by different 
points of view:[ Sofia] 

1. Building a new ontology from scratch 

2. Building ontology by assembling, extending, 
specializing and adapting, other ontology which are 
parts of the resulting ontology. 

3. Reusing ontology, by merging different ontology 
on the same or similar subject into a single one that 
unifies all of them. 

In this study second option is used by utilizing 
terms and concept of AGROVOC along with rice 
production ontology [Thunkijjanukij 2009]. 

In Agriculture ontology are developed through 
many sources. AGROVOC is a controlled vocabulary 
in agricultural covering all areas of interest including 
food, nutrition, agriculture, fisheries, forestry, 
environment etc. AGROVOC is a collaborative effort 
and kept up to date by the AGROVOC team in FAO, by 
a number of involved institutions serving as focal points 
for specific languages, and by individual domain 
experts. To date, AGROVOC contains over 32,000 
concepts organized in a hierarchy; each concept may 
have labels in up to 22 languages. Thunkijjanukij 1 et al 
has proposed rice production ontology for production 
management in Thailand. The ontology contain, 
concepts, terms and relationship related to rice 
production ontology 

AGROVOC arrange terms in agriculture in 
hierarchy of application which helps to develop 
conceptual model of entities and entity 
relationship[Sachit 2012]. The traditional AGROVOC 


[Sachit2012] Thesaurus is made up of terms (Table-2), 
connected by hierarchical and non-hierarchical 
relations. The relations used are the classical relations 
(Table-3) used in thesauri as: BT (broader term), NT 
(narrower term), RT (related term), UF (non¬ 
descriptor). Scope notes and definitions are used in 
AGROVOC to clarify the meaning and the context of 
terms. AGROVOC in addition to ’’terms” also uses the 
notion of ’’concept”, and a larger set of relations 
between concepts. A concept is represented by all the 
terms, preferred and non-preferred, in languages, to 
which it is associated. Both concepts and terms 
participate in relationships with other concepts and 
terms: 


Fertilizer application 

Descriptor 

Fertilizer combinations 

Descriptor 

Fertilizer formulations 

Descriptor 

fertilizers 

Descriptor 

Planting date 

Descriptor 

Planting density 

Non-Descriptor 

Planting depth 

Descriptor 

Planting distance 

Non-Descriptor 

Planting methods 

Descriptor 

Disease prevalence 

Non-Descriptor 

Disease recognition 

Descriptor 

Disease reporting 

Non-Descriptor 

Disease resistance 

Descriptor 

Disease surveillance 

Descriptor 

Disease symptoms 

Non-Descriptor 

Disease transmission 

Descriptor 

Disease treatment 

Non-Descriptor 

Diseases 

Descriptor 


Table-2 Common Agricultural terms(snapshot) 

The processes in wheat production are defined 
by the specifying relationship between concepts and 
terms (Table-3). The concepts may have equivalence or 
associative relationship have forward or inverse 
direction. For instance the object property 
hasIrrigationMethod define relationship between 
agricultural Zone and irrigation process e.g Zone-IV 
hasIrrigationMethod of Rain-Fed. Similarly the inverse 
relationship specifies Rain-Fed isIrrigationMethodof 
Zone-IV. In wheat production ontology AGROVOC 
terms and concepts are utilized for representing 
knowledge. (Table-3)_ 


Relationship 

Inverse Relationship 

Relationship 

Type 

hasCommonName 

isCommonNameof 

Equivalence 

hasCultivationProcess 

isCultivationProcessof 

Associative 

hasCultivationMethod 

isCultivationMethodof 

Associative 

hasIrrigationProcess 

isIrrigationProcessof 

Associative 

hasIrrigationMethod 

isIrrigationMethodof 

Associative 

hasPest 

IsPestof 

Associative 

Produce 

IsProducedFrom 

Associative 

isResistantTo 

IsHarmlessFor 

Associative 

isSucceptibleTo 

IsHarmfuFor 

Associative 


Table-3 Relationship in wheat production ontology 
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Utilizing basic concepts and relationship wheat 
production technology is represented in machine 
readable form. 

IV KNOWLEDGE BASE DEVELOPMENT 

A. Knowledge Base (KB) Wheat Production 

In the next step Knowledge Base is developed 
using wheat ontology. KB comprises representation of 
facts, mechanism for logical reasoning and querying 
methods. Logical reasoning is done by defining rules 
and imposing constraints. We utilized Protege, an 
effective open source tool to develop KB and defining 
rules. Protege employ graph database to store 
information, SWRL[Semantic Web Rule Language] for 
rule development, SPARQL[Protocol and RDF query 
language] for query development and RDF(Resource 
description Framework) to deliver wheat knowledge 
through semantic web. Graph database and RDF are 
basic components of semantic web [Canda] 

The graph database is an effective tool of 
semantic web. Its a kind of database that uses graph 
structures to represent and store data through nodes, 
edges and properties[Silvescu]. The graph database is 
quite different from relational or hierarchical database. 
In graph database resources are related to other 
resources[Fig 2], with no single resource having any 
particular intrinsic importance over another. Resources 
are connected through properties. Graph databases, by 
design, allow simple and fast retrieval of complex 
hierarchical structures that are difficult to represent in 
relational database systems. 



Graph Database Wheat Production Technology(Snap Shot) 

Fig 2 Wheat graph database 


Different methods can be used as underlying 
storage mechanism in graph database which includes 
tables, document or RDF graph. In this research RDF is 
utilized for storage. 

The Resource Description Framework (RDF) 
is a family of World Wide Web Consortium (W3C) 
specifications originally designed as a metadata data 
model. The graph data model is the model the semantic 


web to store data and RDF is the format in which it is 
written. RDF is an XML-based language for describing 
information contained in a Web resource. 

B. Querying Wheat KB 

Retrieval of information from Wheat KB can 
be done in two ways querying the RDF graph using 
SPARQL [Zheng] query language or utilizing rule 
based system using SWRL[Connor]. In this study both 
approaches are used 

C. Rules based system 

Rules based expert system and rules are 
structured way of reasoning and classifying 
information. Production rules in the form of if & then 
clause are used to define or apply constrain in 
declarative manner. Domain rules are structured in 
informal, semi informal and formal ways. However, in 
knowledge base rules are expressed in formal system. 
Informal statements in natural language are transformed 
into formal language or rule execution language. 

In AI several methods can be used for formal 
expression or rules like SQL (Structured Query 
Language), ECA, predicate logic and propositional 
logic. In wheat production expert system ontology 
based predicate logic and axioms are used to define 
rules for extraction and updating of information from 
wheat knowledge base. Developing ontology base rule 
[Kalibatiene 2010] three steps process is carried out to 

• Express rule in informal natural language 

• Express rule using ontology concepts and 
relationship 

• Express rule in formal predicate logic 

• In wheat production management expert 
system conditions and actions are proposed by 
agriculture expert in natural language like 

• 

• Informal Rule: If soil fertility is average and 
rainfall is moderate then 2 begs of urea is 
applied. 

Using this scheme lets take one example 

In-formal: if soil is loamy and soil condition is 

weak then apply 3 bag of NPK or 2 
bag of DAP at the time of sowing 

Ontology based Representation: 

<Soil> < has_type> <loamy> and 
<Soil> <has_condition> <Weak> 
then <Fertilizer> <has_name> = ”NPK” 

<hasQuantity> = ”2 bags”, Or 
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Fertilized <has_name> = ”DAP” <hasQuantity> = ”3 
bags”, 

and <Fertilizer> <hastimeof application> = ”at sowing” 

The ontological expression is transformed into rule 
using SWRL syntax. So formal expression in predicate 
logic will be 

SoilType(?x) A SoilCondition(?y) --> 

Fertilizer(?z), 

Fertilizer(?f) A hastimeof application(?t) —-> 
has_quantity(?q) 

The terms in bracket represent variable which are 
replaced by named individual 

In summary knowledge base of wheat 
production technology basically comprises a set of 
terms concepts and relationships and set of rules which 
can perform reasoning process on wheat production 
technology information in Pakistan and finally propose 
solution to queried problem. 


Web Based User Interface 




m P 

Wheat Knowledge Base 
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Semantic Web Wheat Production Technology 
Fig 3 Semantic Web Wheat Production Technology 

V SOFT COMPUTING MODEL WHEAT 
PROD UCTION MAN A GEMENT 

Finally, we propose a soft computing model 
(Fig 4) for knowledge engineering in agriculture. The 
model employ acquisition of agricultural information 
through conventional methods, developing a knowledge 
base through OWL ontology and implementing a 
reasoning mechanism on top of it using rule base 
system and soft computing techniques for classification 
and probabilistic reasoning. The technical 
implementation of the system is undertaken through 
semantic web (Fig 3). Web based user interface enable 
user submitting problems which are transformed into 


query and submitted to KB. Knowledge is processed 
using embedded rules and solution is delivered using 
RDF 


Fig. 4 Soft Computing Model for DSS in Wheat 
Production Management 

II RESULT AND DISCUSSION 

Knowledge engineering in agriculture is 
essential for developing knowledgebase, expert systems 
and decision support systems. In this paper a frame 
work for knowledge engineering of Wheat production 
technology in Pakistan is discussed. The frame work is 
a four tire process. In the first tire wheat production 
technology is acquired in informal simple language 
(English) structure. In the second step agricultural 
ontology is used for more formal representation using 
specific technical terms which define concepts as well 
as relationship. In the third step rule based system is 
used for logical reasoning and knowledge processing 
and finally acquired knowledge is delivered using 
semantic web through RDF. The essential component 
of the frame work is the updating of knowledge by the 
user who has limited knowledge of information 
technology. Ontology provides basic building blocks 
with underlying logical structure which facilitate 
mapping highly technical agricultural information into 
machine readable form. The proposed system is flexible 
scalable as well as dynamic. 

The developed system is dynamic in the sense 
that it continuously update domain knowledge using 
expert opinion as well as new technology based on 
research and scalable in the sense that system can be 
modified for other crops management technologies. 
With regard to scope of application ontology based 
knowledge representation of wheat production 
management have several advantages as it facilitate 
generic application development of knowledge base for 
different users. Further, use of predicate logic in rule 
based system facilitates adoption of semantic web 


Data Acquisition Knowledge Base 



Soft Computing Model Wheat Production Management DSS 
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technology enabling implementation on diversified 
plat form including embedded mobile phone based 
technology. One limiting factor of the proposed 
system is need for incorporation of some local 
language terms specific to particular region. 
However, this can be achieved by updating or 
redefining of some ontological concepts or terms 

VI FUTURE WORK 

Rule based system for reasoning the 
knowledge base is more efficient if it is capable of 
handling incomplete and uncertain information. 
Authors has plan to uses Bayesian network and fuzzy 
logic system for wheat disease diagnosis [Burney 
2015] and predicting impact of pest and disease 
attack on crop yield to capture the inherent 
uncertainty of these factor as it has profound effect 
on overall production. 
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Abstract 

The usage of VPN services not only helps to 
connect different entities and organizations, it as well 
forms the critical component upon which various 
interactive services related to offering internet 
coverage. As various business localities and settings 
relating to private network augments so does the 
various interconnecting prerequisites as well as the 
network intricacy. The usage of VPN as well forms a 
decisive aspect for the reason that network 
management has turned out to be more essential and 
even more expensive. Undeniably, a good number of 
the large private networks often surpass the 
dimension and intricacy of smaller ones, and it is a 
reason as to why the virtual private network has to be 
excellently studied to showcase the diverse benefits 
that permit it to connect, retain and even sustain 
diverse business models. In this regard, the paper 
aims to discuss the diverse interconnect 
functionalities of VPN; it examines various VPN 
operations along with the various network security 
concerns. 

1. Introduction 

Data fortification in conjunction with 
accessibility occupies an imperative component in 
the execution of diverse procedures. Having ample 
access to networks whilst in remote regions can be a 
frightening situation for lots of human beings. Such 
individuals vary from dynamic salespersons, which 
ought to endorse and connect with the community by 
making use of the diverse networks, company 
executives who should inform and get updated as 
regards the diverse procedures happening in the 
corporations, etc [2]. Such human beings hope that 
they could gain access to their various dealings with 
the PC networks whilst in remote regions. For this 
reason, they yearn for a private network model that 
will connect them to their company network or an 
arrangement that will construct and connect the 
server in their business data center with a peripheral 
internet connection [1]. In consequence of the 


valuable and appropriate information it grasps, the 
safety unease should be reflected on with much 
keenness and consciousness. 

However, such negative occurrences can be 
foiled by making use of the diverse VPN models. 
This representation of private network utilizes an 
unrestricted network to connect isolated sites and 
users at the same time. The VPN representation 
exploits virtual linkages connected by means of the 
internet modes from an exact corporation's private 
network and to the diverse remote sites [2-5]. 
Because of making use of VPN, the data used during 
the linkage is encrypted to hold over any incidences 
of spying and theft of character. Thereby, in the most 
terrible occurrence that other persons seize the data 
traffic; the grabbed ciphers will not be of much 
assistance because they will not relinquish essential 
information. Internet censorship comprises the most 
amplified dangers to the online confidentiality of 
citizens as well as their independence to attain 
appropriate information in addition to other varieties 
of information [6, 7]. This way, there exist a variety 
of applications, for instance, Tor, SSH Tunnel plus 
VPN that give rise to an encrypted channel that 
assists human beings in evading censoring plans. 

On the other hand, such models are merely 
efficient owing to the incidence that the censoring 
framework could only inspect and examine data 
packets via plaintext. With the augmentation of 
traffic scrutiny processes, it is exclusively attainable 
for any unit to unearth receptive and pertinent 
information from the encrypted packages bearing in 
mind the actuality that encryption procedures do not 
modify a packet’s geometric property for example 
length of the package, its appearance plus its 
direction [8]. Given the reality that diverse outlines of 
statistical information can be revealed from 
encrypted exchanges, it is extremely possible to 
recognize traffic’s path in addition to other varieties 
of information from encrypted traffic molds. 
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2. Literature review 

Virtual Private Networks are in the present day 
turning out to be the most widespread technique for 
remote access, given that they permit a service 
provider to exploit the influence of the Internet 
services by making available a confidential channel 
via the communal cloud to apprehend cost savings in 
addition to efficiency augmentations from isolated 
access mechanisms [9-11]. VPN models meet the 
following vital enterprise prerequisites; compatibility, 
safety, accessibility in addition to manageability. 
Generally, a VPN entails an expansion of a venture’s 
private intranet within the Internet hence generating a 
safe private link, fundamentally by the use of a 
private channel. VPNs steadily transmit relevant data 
across the Internet and can be manufactured in an 
assortment of approaches. While some comprise of 
routers in addition to firewalls that are linked to the 
bevy of service providers, others may consist of an 
amalgamation of application alternative firewall, 
infringement exposure, encryption along with 
tunneling, in addition to key management [12]. A 
number of VPNs are supervised internally, whilst 
others are subcontracted to an external service 
provider. VPN refers to a conception that has a 
momentous impact about the future of diverse 
business communications. This element puts forward 
a fresh and pioneering approach as regards the 
conventional quandary of delivering resourceful, 
dependable, and user-friendly telecommunications 
for huge and geographically scattered clusters of 
subscribers [13]. 

The aspect of VPN often substitutes the diverse 
accessible private networks available with a flexible 
design that is effortlessly supervised and one that 
makes available improved services. For this reason, 
VPN relates to a network that is erected by making 
use of public cables - typically the Internet, with an 
aim of connecting to a private network, for instance, 
a corporation's interior network [4]. In this regard, 
there are quantities of arrangements that allow an 
individual to generate networks by utilizing the 
Internet components as the intermediate for 
conveying the relevant data. These arrangements 
make use of encryption, as well as other security 
methods to make certain that only permitted users 
can have access to the network while making sure 
that the information cannot be captured or interrupted 
[7]. Therefore, a VPN is intended to make available a 
safe and encrypted channel through which the 
transmitted data flow between the inaccessible client 
and the corporate network. The relevant information 
conveyed between the two localities by means of the 
encrypted channel cannot be hijacked or interpreted 


by anyone else for the reason that the system has 
quite a few components that protect both the 
corporation's private network along with the exterior 
network via which the isolated client connects 
through. Consumers make use of a private VPN 
model so as to guard their online action and identity, 
since by utilizing this model of an unidentified VPN 
service, the Internet traffic in addition to a user’s 
information remains encrypted, thereby thwarting 
eavesdroppers from getting access to the Internet 
activity [14]. In essence, the model of VPN services 
is in particularly helpful when accessing communal 
channels, for instance, public Wi-Fi given that the 
other modes of public wireless services may not be 
safe. Other than public Wi-Fi safety, the private VPN 
model as well offers customers with unrestricted 
Internet access and it can assist in thwarting the theft 
of essential data theft in addition to unclogging 
websites. 

The distinctive business Local Area Network 
model or LAN, along with the Wide Area Network or 
WAN are some illustrations of private network 
models [15]. The distinction differentiating a private 
network from a public one involves the usage of the 
gateway router. In this regard, a corporation will put 
up a firewall with the sole intention of keeping 
intruders who may be using the public network away 
from the private network. This is as well done to 
prevent the internal users from scrutinizing and 
gaining access to the public network model. Some 
time ago, when corporations could permit their LAN 
models to function as separate and segregate outfits, 
the aspect of confidentiality arose, as there was a 
probability that other persons would invade the 
relevant data transmitted in the process. However, 
nowadays, it is advisable for each entity to have its 
own model of LAN along with its own identification 
design, electronic mail system, as well as its own 
preferred network etiquette - neither of these aspects 
should be in harmony with the module of Virtual 
Private Network [16]. Corporations, as well as 
businesses, make use of a VPN model to be in touch 
in private over a public mode of the network as well 
as to send videos, voice or even information. The 
private model of network forms an exceptional 
alternative for remote employees and businesses with 
worldwide bureaus and associates to share 
information in a private way. The most commonly 
used mode of VPN is the virtual private dial-up 
network or the VPDN [13]. It entails a user-to-LAN 
mode of connection, and where isolated users ought 
to be connected to the corporation’s LAN. Another 
model of VPN that is widely used is known as site- 
to-site VPN. In this form, the corporation invests in 
hardware to link various sites to their LAN model by 
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means of a public network, more often than not the 
Internet. 

A VPN model is an enhancement of a venture’s 
private Internet within a public network, for instance, 
the Internet, constructing a protected private relation, 
fundamentally by means of a private channel. VPNs 
firmly transmit relevant data across the Internet hence 
connecting isolated users, different offices, along 
with business associates into an extensive corporate 
network. The VPN model is virtual, and this denotes 
that the physical form of the network should be clear 
to whichever VPN connection [3]. It as well signifies 
that the user of the VPN module does not possess the 
physical network of the module; however is a 
communal network shared with other numerous 
users. To make possible the essential clearness to the 
upper levels, protocol-tunneling methods are utilized 
[5]. To prevail over the repercussions of not 
possessing the physical network model, service level 
conformities with network providers ought to be 
instituted to make available, in the best probable 
approach, the performance as well as accessibility 
prerequisites required by the VPN model. This model 
is private, and this signifies that there is an aspect of 
privacy in relation to the traffic passage that flows 
through the VPN model. VPN traffic habitually flows 
over the public networks and for that reason, safety 
measures ought to be met to make available the 
obligatory safety that is needed for whichever 
particular traffic report and data that is to pass 
through the VPN connection [6]. Such security 
obligations comprise encryption of data, the 
verification of data origin, and secure creation in 
addition to the appropriate restoration of 
cryptographic means that are required for encryption 
along with validation, defense against a rerun of 
packets, in addition, to address spoofing. The VPN 
model is a network, and although not physically 
existing, it must efficiently be and seen as an 
expansion of a corporation’s network infrastructure. 
In this regard, it ought to be made accessible to the 
other models of the network, to every aspect or a 
specific division of its mechanisms and functions, by 
usual ways of topology for instance routing in 
addition to addressing. 

3. Findings 

As more corporation resources shifted to 
incorporate computers, though, there arose the 
requirement for these workplaces to interrelate and 
integrate through an internet connection. This was 
conventionally done by means of rented phone lines 
of contrasting speeds and this way, a corporation 
could be guaranteed that the connection was 


constantly accessible, as well as being private [9]. 
This way, a virtual private network model simulates 
the private network model over the public one, for 
instance, the Internet. 

4. Open SSH 

OpenSSH refers to a network level security that 
is founded on the SSH procedure. This model is 
utilized in protecting communications that are 
conveyed via a network by means of encrypting of 
the network traffic. Such models are attained by 
encrypting the traffic transiting via the network by 
making use of abundant verification techniques and 
in addition to proffering safe tunneling abilities. The 
OpenS SH model integrates the aptitude of 
forwarding isolated TCP ports via a protected 
channel, and this permits the TCP openings on the 
server area on the user’s part to be connected by the 
use of the SSH channel [11]. Such an application is 
applied to multifarious supplementary TCP links via 
a solitary SSH association by this means obscuring 
connections as well as encrypting procedures that 
might otherwise not be safe. Such a procedure as well 
permits the system to evade firewalls that may have 
the possibility of revealing safety concerns within the 
entire network. However, there are additional third- 
party modes of software that are utilized to hold up 
tunneling over SSH, and they include the likes of 
CVS, rsync, DistCC, in addition to Fetchmail [13]. 
On the other hand, OpenSSH has an extraordinary 
susceptibility where if a certain network is utilizing 
the default mode of configuration, the aggressor has 
an opportunity to recover the plaintexts. In this 
regard, the release and utilization of OpenSSH 5.2 
thereby modified the conduct of earlier versions of 
permitting hackers to have unrestrained access to the 
diverse plaintexts. Such an action aided in further 
lessening the susceptibility of OpenSSH models. 
Attributable to such occurrences and regardless of the 
diverse vulnerabilities that were observed in 
preceding versions, by using the OpenSSH model, it 
is probable for a VPN structure with manifold 
stratums to be effectual on other OpenSSH [14]. In 
this regard, the SSH model has initiated new modes 
that make it easy to constitute VPNs that are 
constructed using the various existing SSH 
verification methods. 

It is evident that in times gone by, users were 
made to log in using personal accounts unlike the 
situation nowadays. It is hence evident that the 
OpenSSH model has the diverse characteristics that 
permit for unproblematic tearing down of the SSH 
link by having the user not essentially having to track 
diverse PIDs. The OpenSSH tunneling attributes 
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necessitate that the multiple startups have to contain a 
limited source login for the reason that it is essential 
to ascertain the SSH VPN mode [3]. Such an 
occurrence is pointed to the actuality that the user 
component that is attached to SSHD server ought to 
have the authorization to institute a channel interface. 
On the whole, it is extremely possible to make use of 
the tunneling attributes to set up an enhanced SSH 
Layer 3 VPN linking various users through a Wide 
Area Network. The safety characteristic of the 
OpenSSH model puts forward a protected tunneling 
model by making use of the diverse verification 
schemes. The OpenS SH model is in addition 
significant in making sure that the WAN 
representation is protected by offering the various 
information encryption services and repairs to the 
relevant data that is conveyed to the diverse private 
networks from the public models of networks [12]. 
Even though there are several vulnerabilities related 
to the OpenS SH models for a multi-layer outline, it is 
achievable for the OpenS SH model to be successfully 
functional in the Wide Area Network model by use of 
multiple users. Such an occurrence has contributed 
optimistically to the area of network security in 
particular on a WAN model that has numerous users. 

4.1 GoHop: Personal VPN to defend from 

censorship 

GoHop obscures its traffic models with the sole 
aim of evading censoring entities in conjunction with 
traffic scrutiny [16]. GoHop productively converts 
traffic’s packet measurement in addition to its 
transmission and hence is a quicker representation 
than Tor with reference to traffic scrutiny. Besides, as 
an end result of GoHop’s simplicity, it functions 
better than auxiliary censorship circumvention 
representations [5]. As a personal VPN mechanism, 
GoHop is dependable as the user plus the server are 
both in a trusted relationship, and in the similar 
entity. 

4.2 LISP-based instant VPN services 

LISP is being regulated in IETF models and it 
disconnects the IP address function in the routing 
locators (RLOC) along with endpoint identifiers 
(EID) [2]. The ID separation procedure is appropriate 
for instantaneous models of virtual private network 
(VPN) services in view of the fact that it has a range 
of tunneling characteristics. LISP is extremely 
significant in the devising of an outline that generates 
abundant and rationally alienated topologies that 
function over one widespread infrastructure plus 
resource model [15]. In this regard, there is the 
conception of Virtual Routing and Forwarding or 


VRF that is practical in generating plentiful 
illustrations of segmentation at the VPN stage. LISP 
replicates two prototypes in the mold of RLOC as 
well as EID and can be functional in virtual 
networking via the two. On the other hand, the LISP 
mapping arrangement can be practical in plotting 
virtualized EID systems to RLOC arrangements. 

5. Conclusions and Recommendations 

It is of efficiency and critical importance to have 
a model that relates to the needs of the user. Usage of 
the internet has brought about a host of concerns and 
predicaments that have to be handled with extreme 
caution. It is in this aspect that the aspect of VPN 
arises as it entirely relates to the entire concept. Ever 
since people started to make use of technology to be 
in touch, there has constantly been an obvious 
partition involving the public as well as private 
networks. A model of public networks, for instance, 
the public communicating system along with the 
Internet, relates to an outsized compilation of 
unrelated components that swap over information 
comparatively without restraint with each other. In 
this regard, the citizens with admittance to the models 
of public networks might or might not have anything 
that binds them together: they have nothing in 
common. Any given individual in this mode of the 
network may only be in touch with an undersized 
portion of his/her probable users. This mode of the 
private network is composed of PCs that are owned 
by a private business that share information 
exclusively with each other. This way, the entities 
involved are convinced that they are the only ones 
making use of the network, and the actuality that the 
information sent between them is only shared and 
observed only by the other individuals in the cluster. 
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Multi-agent architecture for distributed IT GRC 

platform 
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Abstract — The IT-GRC platform is a solution that is based on 
the paradigm of distributed systems, based on multi-agent systems 
(MAS) in its different parts namely the user interface, the static 
and dynamic configuration of the organization management 
profiles, the choice of the best repository and the processing of 
processes, it takes advantage of the autonomy and learning aspect 
of ADMs as well as their high-level communication and 
coordination. However, these technological components are 
difficult to manipulate, or users lack the necessary skills to use 
them correctly. In this situation, the modeling of a communication 
architecture is necessary, in order to adapt the functionalities of 
the platform to the needs of the users. To help achieve these goals, 
it is necessary to develop a functional and intelligent 
communication architecture, adaptable and able to provide a 
support framework, allowing access to system functionalities 
regardless of physical and time constraints. 

Index Terms —Multi-agent systems, IT GRC, frameworks, best 
practices, communication system, distributed system, information 
system. 

I. Introduction 

aced with a competitive market for IT solutions, 
information systems are made up of heterogeneous 
components, with increasingly complex information flows 
and processes. The decision of stakeholders in the area of IT 
governance has become sensitive. Hence the need for adequate 
tools of IT governance. 

The modeling of an IT GRC platform must take into account 
several parameters. First, the system must ensure and evaluate 
the alignment of the business objectives of the company with 
the objectives and strategy SI. Then, it must choose the best 
reference framework for the Governance, Risk and Compliance 
of Information Systems. This repository of good practices 
aligns the strategy of information systems through a set of 
guidelines that serve as benchmarks for business processes. 

The platform of governance, risk management and information 
systems compliance that we proposed in a previous work, 
communicates with the stakeholders of the IS, namely the 
Director of Information System (DSI) and the Business 
Managers from each department. It has an intelligent semantic 
engine that allows translating the objectives expressed by its 
users, in language understandable by the most widespread 
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reference systems (COBIT, ITIL, PMBOK, IS027001, 
IS027002, IS027005, MEHARI, and EBIOS). In order to 
implement the appropriate IT GRC processing, a multi-criteria 
decision system is integrated, making it possible to choose the 
best repository for a given request. Our IT GRC platform 
encapsulates each repository into an expert system for end-to- 
end evaluation in an interactive way with the user. These 
repositories are updated each time a new version of these has 
appeared. 

The IT GRC platform is composed of several systems that 
lead to good governance. Each of these systems is responsible 
for performing specific tasks: 

• EAS-Strategic: aligns the business needs of the company 
with IT objectives and IT processes; 

• EAS-Decision: receives the IT objectives expressed by 
EAS-Strategic. It is able to choose for a request from the 
strategic layer the best reference for IT governance, risk 
management and compliance; 

• EAS-Processing: encapsulates each IT GRC repository in an 
intelligent and autonomous system that deploys the actions and 
implements all the recommendations of the best repository 
chosen by EAS-Decision in an interactive way, which allows to 
manage the activities desired end-to-end IT processes and 
generate action plans. 

However, these technological components are difficult to 
manipulate, or users lack the necessary skills to use them 
correctly. In this situation, the modeling of a communication 
architecture is necessary, in order to adapt the functionalities of 
the platform to the needs of the users. To help achieve these 
goals, it is necessary to develop a functional and intelligent 
communication architecture, adaptable and able to provide a 
support framework, allowing access to system functionalities 
regardless of physical and time constraints. 
A functional architecture defines the logical and physical 
structure of the components that make up a system, and the 
interactions between these components [1], [2] and [3]. If we 
focus on intelligent and distributed architectures, the main 
paradigm to consider is the multi-agent system. 
EAS-COM is a new architecture focused on product 
development based on multi-agent systems. It integrates this 
technology to facilitate the development of a flexible 
distributed system by taking advantage of the characteristics of 

H. IGUER, S. FARIS and H.MEDROMI, are with LRI Laboratory, Systems 
architecture team, Hassan II University, Casablanca, Morocco 
(haiar.iguer@gmail.comT ( sophiafaris 1989@gmail.com) . 

(hmedromi@yahoo. fr) 



S. ELHASNAOUI, with LPRI Laboratory, EMSI, Casablanca, Morocco, and 
with LRI Laboratory, Systems architecture team, Hassan II University, 
Casablanca, Morocco (elhasnaoui.soukaina@gmail.com) 


68 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 




International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 2, February 2018 


agent interaction to model the functional system. 

This paper is organized as follows. We start with a state of the 
art communication systems. Then, two architecture versions 
will be presented from our communication system according to 
the two existing communication models: communication by 
information sharing and communication by sending message. 
Finally, we present a new hybrid approach to workflow 
management within the IT GRC platform based on multi-agent 
system. The latter combines the two communication models. 
This approach makes it possible to manage the communication 
between the different layers of the platform while ensuring the 
follow-up of the steps, from the expression of an IT service (IT 
objective in terms of IT process) to the processing of IT 
processes included and the proposal of action plans to put in 
place. 

II. State of the art of Communication Systems 

A. Motivation 

To support the construction of distributed systems, we are 
witnessing a constant evolution of models that make extensive 
use of software engineering, model analysis, etc., to facilitate 
the implementation of these systems. 

However, the implementation of distributed systems is not 
tied to a specific communication system that manages the 
workflows between these applications. Nevertheless, the 
technological choice of the most appropriate communication 
system is a fundamental task to ensure the integration of the 
components and the scalability of the system. 

A large number of research work, which works in different 
areas of workflow management, can be found in various 
literatures. Each research area has its own specifications and 
requirements for managing a treatment request. Information 
flow management is important for the use and sharing of 
resources in a distributed system to establish meaningful 
communication. Many studies have combined communication 
management with agent technology. This involves the support 
of different approaches for the implementation of this 
technology. 

A communication system aims at interconnecting and 
implementing the distributed platform systems in order to 
effectively address the problems faced by companies in terms 
of reusability, interoperability and reduced coupling between 
the different systems that implement their systems. Information 
systems [4]. Thus, ensuring interoperable communication 
between different distributed applications is the main problem 
for distributed systems. Workflow management techniques 
could meet these requirements. In addition, workflow 
management is implemented as an interconnection of less 
complicated tasks [5]. The concept of workflow has been 
studied in many areas of research. 

B. Context and methodology 

Infrastructure based on multi-agent systems generally 
facilitates autonomous communication between distributed 
systems [6]. Recently, various improvements have been 


implemented on communication based on multi-agent systems. 
These improvements can overcome some interoperability 
issues, but different types of technologies and development 
models used can cause interoperability issues. Like the software 
architecture of any system and the most important part, some 
necessary requirements can be included in a communication 
system such as availability, autonomy recovery from a failure 
and transmission guarantee. On the other hand, existing work 
based on agent technology has some gaps in communication 
systems [7]. Building a flexible, self-contained communication 
system would be a good initiative to manage workflows in the 
IT GRC distributed platform. In addition, the use of agent 
technology can be useful for facilitating communication in a 
distributed platform, as it provides significant attributes such as 
adaptation, interactivity, multi-protocol support and 
implementation. Light work [8]. 

C. Research work related to communication management in 
distributed systems 

Many research studies have been found in the literature 
concerning the improvement of communication systems within 
distributed platforms, based on agents and multi-agent systems. 
Most of these works address specific problems and some are 
too general and not specific to a particular problem [9]Several 
research studies have been conducted to solve the various 
communication problems such as synchronization, reliability, 
the communication language of the agents. As Table 1 shows, 
there are twelve relevant research studies based on agent 
technology that relate to communication in distributed systems. 
The table summarizes and compares different approaches based 
on attributes related to communication. We can divide these 
works into 2 categories: synchronous and asynchronous 
communication. The works were selected from several types of 
technologies. The attributes used in the evaluation were 
essentially chosen from generic specifications and requirements 
for communication in distributed systems. The significance of 
each of the attributes included in this comparative study are as 
follows 

• Type of communication: and type of communication 
style 

• Availability: The availability of the application to 
respond to requests for treatments at any time. 

• Autonomy: refers to the intelligent level of the system 
to manage and implement queries. 

• Message type: This is the type of message used in 
communication. 

• Scalability: is the ability to manage distributed system 
expansion 
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N° 

Communication systems within distributed 
platforms 

Communication 

Type 

Availability 

Autonomy 

Message 

Type 

Scalability 

1 

An Agent Platform for Reliable Asynchronous 
Distributed (1) 

Asynchronous 

medium 

high 

ACL 

Low 

2 

Agent-Based Middleware for Web Service 
Dynamic (2) 

synchronous 

Low 

high 

WSDI 

medium 

3 

XML-based Mobile Agents (3) 

synchronous 

Low 

high 

XML 

Low 

4 

An Agent-Based Distributed Smart Machine (4) 

synchronous 

Low 

high 

KQML / 
ACL 

medium 

5 

An Agent XML based Information Integration 
Platform (5) 

synchronous 

Low 

high 

SOAP 

Low 

6 

A Cross-Platform Agent-based Implementation 
(6) 

synchronous 

Low 

high 

ACL 

high 

7 

Communication System among Heterogeneous 
Multi 

-Agent System (7) 

synchronous 

Low 

high 

ACL 

medium 

8 

FACL (Form-based ACL) (8) 

synchronous 

medium 

high 

Forme 
based (ACL) 

Low 

9 

ACL Based Agent Communications in Plant 
Automation (9) 

Asynchronous 

medium 

high 

ACL 

medium 

10 

Multi-agent Systems for Distributed 
environment (10) 

synchronous 

Low 

high 

ACL / 
KQML 

medium 

11 

SOA Compliant FIPA Agent Communication 
Language (11) 

synchronous 

Low 

high 

ACL 

medium 

12 

An Agent-Based Distributed Information 
Systems Architecture (12) 

synchronous 

Low 

high 

ACL 

high 


Table 1. Scientific work related to communication management 
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The research work cited in Table III-l is as follows: 


(1). 

[10], 

(2). 

[11], 

(3). 

[12], 

(4). 

[13], 

(5). 

[14], 

(6). 

[15], 

(7). 

[16], 

(8). 

[17], 

(9). 

[18], 

(10) 

•[19], 

(11) 

•[20], 

(12) 

•[21]. 


The comparative study shown in Table 3-1 shows that the 
systems (1) and (9) rely on asynchronous communication while 
the other systems are based on synchronous communication. 
We emphasize that for systems (6) and (12) the criterion of 
scalability is taken into account, while the availability criterion 
is not. All these architectures ensure the autonomy of their 
systems. However, these architectures have certain limitations 
concerning the environment in which they are integrated. In 
fact, they do not specify the parameters of information flows, 
nor the learning and adaptation of information. 
In addition, this comparative study shows that most of these 
architectures do not deal with the security aspect for data 
exchange, nor for access control to distributed systems. They 
are satisfied by checking the user’s identifications through a 
login and a password. 

It should be noted that while the presented architectures provide 
access to distributed systems based on ADMs, they have 
limitations in terms of distribution, data adaptation, and the 
scalability aspect that is not addressed in most architectures 
studied. 

We propose a new architecture to overcome the different 
limitations encountered. This architecture is distributed, 
intelligent and able to meet the requirements of governance, 
risk management and compliance of information systems in 
terms of distribution, adaptation of the data provided to the user 
according to different constraints. It ensures the scalability of 
the IT GRC platform to ensure the exchange of data with added 
processing systems 


III. Basic architecture of EAS-COM communication 

SYSTEM 


EAS -COM (see Figure 1: EAS-COM is represented by the 
cross-section of the platform) is a communication system that 
facilitates the integration of distributed systems from the IT 
GRC platform. This system is dynamic, flexible, robust, 
adaptable to each user's request, scalable and easy to use and 
maintain. However, this architecture is extensible to integrate 
the desired processing system, without dependence on a 
specific programming language. The systems integrated in the 
IT GRC platform follow an integrated communication protocol. 
Another important feature is that, thanks to the capabilities of 
the agents, the systems developed can make use of learning 
techniques to manage decisions made previously and which are 
recorded in knowledge bases. EAS -COM offers a new 
perspective, where multi-agent systems and web services are 
integrated to provide communication needs, leverage their 
strengths, and avoid their weaknesses. 

Ea^-UGRC mmmm 



Figl. Positioning the EAS-COM Communication System in the IT GRC 
Platform 


A. Introduction to EAS-COM subsystems 

The communication system using multi-agent systems 
requires answering the following questions: 

• Which functions should be modeled as agents? 

• How to decompose the communication system to be 
supported by a multi-agent system approach? 

• What types of interactions should exist between agents? 

• What skills and resources do agents need? 

• What priorities should be considered in workflow 
management studies, and how are these priorities to be 
addressed by agents? 

In order to solve the communication problem within the IT 
GRC platform, we break down the EAS-COM system into 
subsystems. Each subsystem is responsible for performing a 
specific communication task. 

There is a strong link between the choice of agents and the 
purposes for which they are designed. As a result, we study 
workflows between IT GRC platform components based on 
significance. With this in mind, we carry out the following main 
tasks: 

1) Categorize the IT services received from the 
strategic layer; 

2) Request and receive decision processing 
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(interaction with the decision layer) with respect to 
the best repositories; 

3) Manage processing systems (sending IT services to 
process and receiving processing results) taking into 
account the quality of their processing and 
performance. Each task can be assigned to an agent 
or group of agents. 

We call the multi-agent system assigned to the categorization 
of IT services (interaction with the strategic layer) "Strategic- 
corn". It contains agents responsible for tasks (1). 
We call the assigned multi-agent system to communicate with 
the decision-corn decision-making layer. It contains agents 
responsible for performing the task (2). 

We call the multi-agent system assigned to the processing 
management of IT services (interaction with the processing 
layer) "processing-corn". Agents in this multi-agent system are 
responsible for task (3). They must interact in order to achieve 
this goal and to generate a representation of the action plan of 
the treatment results. 



Fig 2. EAS-COM architecture subsystems 


requested IT process. 

Here is the diagram explaining the procedure for categorizing 
an IT service received by the strategic layer. 



Proretil/fFnirm! 


n lcp ir.i [ruitri/iMl 


Fig3. Procedure for Categorizing an IT Service by Strategic-COM 


The IT Service categorization procedure is as follows: 

> First, the IT service received must be broken down 
according to the IT processes it includes. 

> Each IT process is associated with one or more benchmarks 
of good practice according to the discipline to which it 
belongs (IT Governance, IT Risk Management, IT 
compliance) 

> A constitution of the elements of the matrix is carried out 
just after. The elements of the matrix have the following 
form: {Proc i, (Ref 1, Ref 2,... Ref n)} 

> The constituent elements are subsequently grouped in 
order to build the final matrix in the form: { {Proc a, (Ref i, 
Refj,... Refn)}, {Procb, (Refi, Ref j... Ref n)},...., {Proc 
z, (Ref i, Ref j, ... Ref n)}}. This matrix represents the 
categorized IT service ready to be processed by the second 
EAS-COM subsystem. 

We have defined three types of agents: Collector Agent, 
Manager Agent, and Constructor Agent. 


The Collector Agent and Constructor Agent focus primarily on 
organizational tasks, while the Agent Manager performs 
processing tasks. The main objective of the agent manager is to 
guarantee the categorization of the processes of an IT service, 
assigning them one or more repositories that can manage them. 
It communicates the categorization result of each IT process to 
the Builder agent, which, in turn, resembles this information to 
send the categorized IT service to be consumed by the Decision 
Decision Layer (EAS-Decision). 


B. Strategic-COM 

The Strategic-COM subsystem provides communication 
with the strategic layer represented by the EAS-STRATEGIC 
system. The latter expresses the strategic needs of the user in 
terms of IT service (IT Service) by defining the IT processes 
(IT Processes) that must be managed by the IT GRC platform. 
The IT services deducted are redirected to EAS-COM, and 
more precisely to the Strategic-COM subsystem. 
Recall that Strategic-COM is supposed to categorize the IT 
processes included in the requested IT service. Categorization 
is the combination of each IT process into one or best practice 
repositories that can define the management activities of the 


72 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 


























































International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 2, February 2018 


Strategic 

Layer 



Fig 4. Strategic-COM Agents 


Collector Agent: 

Collector Agent performs an organizational task. He is 
responsible for the reception of IT Services Strategic Layer 
(EAS-Strategic). To achieve its goal, it checks the structure of 
the received web services, it classifies them according to the 
date of their creation by the user (date of creation is specified in 
all IT service). At the end of his treatment, he transfers the IT 
Services to the Manager Agent. 

Agent Manager 

Manager Agent is the heart of Strategic-COM. It categorizes IT 
services by associating each IT process with one or more 
appropriate standards for its implementation. At the end of the 
processing, it merges the elements of the matrix, which will 
constitute the IT service categorized in the form {Process IT, 
{refl, ref2... refn}}. This result will be transferred to the 
constructor agent. 

The Agent Manager has a knowledge base, it concerns that the 
agent knows the repositories that can manage the IT processes 
issued by the strategic layer. This knowledge depends at the 
beginning of the mapping of COBIT processes with the other 
repositories. This mapping list will be fed as and when learning 
the IT GRC platform. The knowledge of the agent manager can 
be divided into two types: knowledge of IT GRC disciplines (IT 
Governance, risk and compliance management) and their 
associated repositories and knowledge about categorizations 
that were done previously. 

When categorizing IT Processes, the knowledge of the Manager 
agent against the repositories of good practices is enriched and 
updated, which allows the agent to associate a set of repositories 
of good practice when his treatment to come. The enrichment 
of knowledge is possible since the agent Manager constantly 
monitors the changes of its environment and exchange 
information with the Updater. 


Constructor agent 

The objective of this agent is to provide a readable 
representation of the IT service he handles, while preserving as 
much as possible the data setting of the IT Service (the user 
creator of the IT service, the date of its creation, and priority of 
IT processes ...). To achieve this goal, it retrieves the result of 
the categorization of the IT processes provided by the Manager 
agent and rebuilds the final matrix that represents the 
categorized IT service that will be sent to the decision-making 
layer (EAS-Decision) in the form of web service. 


Interacts with 



Fig 5. Distribution of Strategic-COM Agents according to their tasks 


C. Decision-COM-agent DD 

The Decision-COM subsystem provides communication with 
the decision layer represented by the EAS-Decision system. 
This communication consists of sending the categorized IT 
service to the decision-making layer represented by the EAS- 
DECISION system. Once the decision is made, compared to the 
best repositories to associate with each of the IT processes 
included in the IT service, Decision-COM receives the result of 
the decision, represented by the decided IT service. The latter 
must have the following format: {(Proc a, ref i), (Proc b, ref j)... 
(Proc z, ref n)}. 

We define the agent: DD Agent performing an organization 
task. Its main objective is to ensure the communication of the 
IT service with the decision layer. It receives the categorized IT 
service from the Builder agent and translates it into a web 
service to be able to send it to the decision layer, and it stays 
tuned to receive the result of the decision. Once it is received, 
it is transferred to the processing-Com subsystem for 
processing. 
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Fig 6: Decision-COM Agents 


D. Processing-COM 

The Processing-COM subsystem provides communication 
with the processing layer represented by the EAS- 
PROCESSING systems. EAS-PROCESSING-type processing 
systems treat IT processes, following the recommendations 
dictated by the repository chosen by the EAS-COM decision¬ 
making system, in order to generate the action plans to be put 
in place to meet the strategic needs of the organizations. Users 
of the IT GRC platform. The processing requests are triggered 
by the EAS-COM communication system and more precisely 
by the Processing-COM subsystem. 

We define four types of agents: Agent Comln, Agent Admin, 
Agent Directory, and Agent ComOut. 

The Comln and ComOut agent focuses on the organization 
tasks, while the Admin Agent and Directory perform processing 
tasks. The main goal of the Admin Agent is to ensure that the 
process processing included in the IT service is allocated to the 
correct processing systems. It interacts with processing-Com 
agents in order to achieve an optimal choice of processing 
systems. During this interaction, the Admin agent interacts with 
the ComOut agent to resolve multiple processing requests from 
the same processing system. To achieve all these goals, agents 
act according to their knowledge and skills. 

Comln agent: 

The Agent Agent Comln is a communicating agent. It 
receives the decided IT service from the Decision-corn 
subsystem and the transfer to the Admin Agent to determine the 
processing systems that can manage the IT processes included 
in the decided IT service. 

Admin Agent 

The Admin Agent invokes the processing system that is best 
placed to complete the IT service processing and generate the 
action plans to be put in place. 


If there are multiple systems that can resolve the requested 
task, the Admin Agent has the ability to select the optimal 
choice. This ability of the decision in relation to the choice of 
processing system depends on the performance of the latter, its 
number of execution, availability.... This information is stored 
in its knowledge base that it uses during the resolution of 
conflicting situations. With each choice made, it communicates 
with the ComOut agent and determines the best system to 
trigger. 

During processing, the Admin Agent stores important 
information such as, the useful results he has obtained from 
previous treatments and changes in his environment. When the 
Admin agent chooses the processing system, he evaluates the 
results of his action, updates his knowledge. Then he goes to 
the choice of the system that will manage the IT process. 

Agent Directory 

The Directory Agent takes care of the recording of the 
systems processing reports, as well as information about them 
(system performance, number of execution,). This information 
is taken into consideration by the Admin Agent to choose the 
most relevant processing system. 

ComOut agent 

Notifying and triggering processing systems that can handle 
all processes in an IT service is a complex task that can lead to 
additional processing time, and therefore can slow down the 
execution of requests. Therefore, we need to partition the 
processing request to all processing systems. The control of the 
notification of each system is therefore assigned to a specific 
agent (ComOut agent). In this step, we propose a new approach 
in which the triggering of IT service process processes can be 
partitioned. Our idea is to trigger the set of processing systems 
chosen to implement the processes of the same IT service. 
During this trigger, the ComOut agent receives the list of 
processing systems to be notified. This list should contain the 
information of these systems namely the name of the system, 
the description, the IP address of the server in which the 
processing system is running. 

This method provides simultaneous processing of all processes 
included in the IT service. However, there may be situations 
where multiple requests for processing are not allowed, such as 
requests for processing multiple processes by the same 
processing system, which could significantly reduce the 
performance of the same. In this case, the Admin agent asks the 
ComOut agent to check the status of the affected system and 
inform him that he is busy and cannot accept further requests 
until he finishes. The Admin Agent must then decide to choose 
another processing system that could handle the request or wait 
until it becomes available. Therefore, the importance of the 
EAS-COM Processing-COM subsystem lies in the acceleration 
it gives to the triggering process, and hence the process 
processing of a requested IT service. 
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Fig 7. Distribution of Processing-COM Agents according to their tasks 


IV. Conclusion 

In this paper, we presented the main features of EAS-COM, 
the subsystems that make it up and the agents that federate each 
of these subsystems. We have not specified the mode of 
communication to be considered between the system agents and 
between EAS-COM and the other distributed systems of the IT 
GRC platform. 

In the next works, we propose architectures of the EAS-COM 
system based on the modes of communication: Communication 
by information sharing and communication by sending 
message. 
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Abstract 

Traffic analysis is a process of great importance, when it comes in securing a network. This analysis can be classified in 
different levels and one of most interest is Deep Packet Inspection (DPI). DPI is a very effective way of monitoring the network, 
since it performs traffic control over mostly of the OSI model's layers (from L3 to L7). Regular Expressions (RegExp) on the 
other hand is used in computer science and can make use of a group of characters, in order to create a searching pattern. This 
technique can be combined with a series of mathematical algorithms for helping the individual to quickly find out the search 
pattern within a text and even replace it with another value. 

In this paper, we aim to prove that the use of Regular Expressions is much more productive and effective when used for 
creating matching rules needed in DPI. We design, test and put into comparison Regular Expression rules and compare it 
against the conventional methods. In addition to the above, we have created a case study of detecting EternalBlue and 
DoublePulsar threats, in order to point out the practical and realistic value of our proposal. 

I. Introduction 

With the rapid increase of Information Systems, human activities became more and more dependent on Internet and 
Network Systems making it inevitable the increment of threats, as criminals are targeting unprotected or poorly 
secured systems and networks in order to take advantage of their victims in various ways. Attackers use various attacks 
and techniques aiming in disrupting the availability, confidentiality and integrity of the systems. 

Attackers, nowadays, have developed techniques and tools that are more powerful, stealthy and persistent, which can 
help them make their intrusion within their victim Information Systems more easily, aiming at stealing valuable 
information and intelligence, or control them remotely causing disruption and distraction of the system that is under 
attack. For avoiding such incidents, organizations, enterprises and institutions are choosing to deploy Intrusion 
Detection and Prevention Systems (IDPS) in order to protect more proactively and sufficiently their networks, since 
the usage of a simple anti-virus or firewalls is not adequate enough to protect their assets. According to the SANS 
Institute, an Intrusion Detection System (IDS) is a system dedicated to the employment of security of the rest of the 
systems, within a particular network [1], [2]. The need to establish IDPS is made mandatory by the fact that attackers 
are using more advanced attacking methods for corrupting the functionality of a system, and are able of bypassing the 
firewall or the anti-virus systems, that are usually integral elements of the information system. No one can question 
the need that IDPS are protection the modem Information Systems and Networks, but still they cannot be seen as a 
silver bullet against all the threats, which are getting more and more advanced. Due to lot of technological limitations, 
Intmsion Detection and Prevention Systems can be vulnerable to attacks, as well, or fail to detect some threats 
efficiently. For that reason, a lot of studies and different implementations are tested and developed in order to minimize 
the sensitive aspects of IDPS. Regular Expressions (RegExp) are used in computer science and can make use of a 
group of characters, in order to create a searching pattern. This technique can be combined with a series of 
mathematical algorithms for helping the individual to quickly find out the search pattern within a text and even replace 
it with another value. 

In this paper, we aim to prove that the use_of Regular Expressions is much more productive and effective when used 
for creating detection mles in DPI. We are going to constmct, test and put into comparison our Regular Expression 
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rules against the conventional method. In addition to the above, we have created a case study of detecting EternalBlue 
and DoublePulsar threats, in order to point out the practical and realistic value of our method. 

II. Related Work 

There is a great volume of related work done already, each having a different approach on how Intrusion Detection 
Systems, can use Deep Packet Inspection in combination with other approaches in order to improve their efficiency. 
Dr. V.M. Thakare et al.[3] are doing a survey on the existing techniques related to the combination of Deep Packet 
Inspection and the Regular Expression pattern. They start by pointing out the positive aspects of the Deep Packet 
Inspection compared to the traditional types of packet inspection that have been used over the years and how the 
Regular Expressions can join forces with DPI. The first of these technologies, which they examine, is LaFA 
(Lookahead Finite Automata). According to them, LaFA has the huge advantage that it does not bind great amount of 
the system’s memory due to its properties. LaFA is a software-based Regular Expression approach. Next, they are 
examining a hardware-based solution called Ternary Content Addressable Memory (TCAM), which is used widely 
on network devices. This type of approach is ideal for the cases, where it is important to have a high rate of Regular 
Expression Match. The third approach follows the Stride Finite Automata (StriFA). As LaFA, StriFA is, also, a 
software-based answer for implementing a Deep Packet Inspection - Regular Expression pair. On the pros of this 
technique is the high speed that it can offer, as well as the low requirements it has in terms of memory usage. The 
fourth technique is the Compact Deterministic Finite Automata (Compact DFA). The authors indicate the fact that 
Compact DFA is just an altered, compressed version of the classic Deterministic Finite Automata (DFA) approach. 
One of the most popular algorithms using this approach is Aho-Corasick (AC algorithm), which is, also, the one used 
by Snort by default. Last but not least, authors are examining the Extended Character Set DFA technique, the purpose 
of which is the minimization of memory needs that the original DFA has. 

C. Amuthavalli et al.[4] are taking a look into two more traditional approaches concerning the solutions that can be 
combined with the Deep Packet Inspection. The first one is the Deterministic Finite Automata (DFA) and the second 
one the Non-Deterministic Finite Automata (NFA). They do, also, make an extended description of the theory related 
to the Intrusion Detection Systems, their evolution throughout the years, as well as, their functions and the types that 
they are divided in. Next, they describe the different kinds of packet inspection, focusing mainly on the Deep Packet 
Inspection and the methods that makes use of it. On the fourth section of their paper, they comment on the different 
types of network attacks by categorizing them into two major classes, the active and the passive attacks. Finally, they 
are closing their paper with the fifth section, which examines the theory and the features of the Regular Expression 
technique, the mathematic expressions of the DFA and NFA and how the Deep Packet Inspection and Regular 
Expressions are used alongside for detecting specific patterns, by giving a brief example of detecting Yahoo traffic 
with the help of the Application Layer Packet Classifier, Linux L7. 

G. Douglas et al.[5] are examining from their perspective the combination of Deep Packet Inspection and Regular 
Expressions. More specifically, their paper is focusing on the conversion of Deep Packet Inspection rule sets into a 
Regular Expressions, like meta format. For the needs of their demonstration, they developed a tool, called Snort2Reg 
Translator. With the help of this translator, they were able of converting Snort rules into Regular Expressions without 
affecting the accuracy of the analysis. Although, as they state, they were other attempts of transforming rules into 
another type of expressions, none of them were focusing on the Regular Expressions pattern, which, according to 
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them, is highly beneficial for Intrusion Detection Systems due to its syntax. For that reason, the authors were able, as 
well, to designate the format of the regular expressions structure by taking into account the type of the protocol that 
the converted rule was intended for. However, there are some limitations to that method since there might be some 
parts of the Snort rule content that it cannot be transformed into Regular Expressions efficiently. On the fifth section 
of the paper the authors provide us, in details, the results of transforming a couple of Snort rules into Regular 
Expressions. 

Yi Wang [6], on her paper, is making her research on how Regular Expression Matching can be proven beneficial for 

Network Intrusion Detection Systems. The main target of the paper is the use of an Improved Grouping Algorithm 

(IGA) for improving Yu algorithm. At the beginning of the paper, Wang is stating the basic principles of Regular 

Expressions, as well as, the mathematic expressions for the Non-Deterministic Finite Automata and Deterministic 

Finite Automata. Up next, she is explaining the idea behind the Regular Expression Grouping Algorithm that will be 

used in order to improve the algorithm. For the sake of the analysis, Wang performed the testing with different 

parameters using three different engines: L7-Filter, Snort and Bro and comments on the results. 

III. Intrusion Detection systems and Regular Expressions 
An IDS is in charge of monitoring information systems and the network traffic between them in order to analyze and 

detect abnormal or hostile traffic that is generated either outside or within the network by misusing it or trying to 

attack it [7]. 

From the definition mentioned above, we can understand that the main goal of Intrusion Detection System is to 
recognize potential incidents. When such an incident is identified, the IDS, usually, performs a series of functions 
such as: 

• Making local records regarding the detected incident. 

• Generating notifications for the security administrators about uncommon, observed incidents. These 
notifications are, also, known as alerts. 

• Providing details and comprehensive reports related to the incidents that caught the attention of the IDS. 

No matter how well established an IDS is, there is no way of eliminating completely false positive (The state under 
which no attack or threat occurs against the network, however the IDS falsely generates an alert) and false negative 
alerts (a threat occurs for real and the IDS is not successful in detecting it. This type of alert is thought to be very 
dangerous for a network, since part or the entire network might be compromised without the administrators being 
notified in order to take care of the issue) and hence, the accuracy of the IDS depends upon the total rate of falsely 
and correctly generated alerts. More specifically, the precision of an IDS can described with the formula: 

Precision = TP/(TP + FP ) 

where TP represents the sum of the True Positives and FP the sum of the False Positives. 

There are various types of IDPS but we are focusing on Network-based IDPS (NIDPS): This type of IDPS is 
responsible for monitoring the traffic and devices of a particular network. It is also in charge of analyzing the network 
(e.g. IPv6, ICMP, IGMP), transport (e.g. TCP, UDP) and application (e.g. DNS, HTTP, SMTP) layer in order to 
recognize any malicious behavior within the network. Analysis can be performed at the hardware layer, however this 
capability is limited. 
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A. Packet Inspection Levels 

There are diverse technologies related to the packet inspection in networks. These technologies can be divided into 
three categories [8], [9]: 

• Shallow Packet Inspection (SPI) 

• Medium Packet Inspection (MPI) 

• Deep Packet Inspection (DPI) 

In the Shallow Packet Inspection (SPI), we are examining only the headers of the packet. These are located at the start 
of the data containing information such as the IP addresses of the sender and the receiver and are, usually, used for 
routing purposes. Therefore, the parties that are involved in the communication can retain a level of anonymity, since 
we do not examine the payload of the packet and we do not have any further knowledge about the packet’s contents. 
This type of inspection is the simplest one among the three. 

MPI inspection is used in cases when we want to prevent certain activities such as downloading content from specific 
web sites. It can also be used when there is the need to give priority to particular types of packets. This can be achieved 
by examining the application layer of the packet. 

By employing DPI packet inspection we are able of detecting the exact source and content of every packet within the 
network monitored. DPI is inspected every and each of the network layers, making it ideal for performing detailed 
network analysis and it can be implemented even by using software or hardware devices. 

B. Regular Expressions 

A Regular Expression (RegExp) pattern can contain both regular characters, in other words characters that are having 
a literal meaning and metacharacters, characters that have a different meaning from the traditional ones [10]. The 
content of a RegExp pattern can vary ranging from being identical to the traditional Exact Match pattern to more 
complicated expressions. Although RegExp can mimic the Exact Match’s pattern, this usage cannot unveil the full 
potentials of this technique. For this reason, Regular characters and metacharacters are commonly combined in a 
certain way for creating more advance searching patterns. The basic syntax and quantifiers for regular expressions can 
be found on Appendix A. 

The use of RegExp for DPI has been studied thoroughly from systematic perspective and sufficient application as 
well as technical background has been provided on how RegExp could be used in DPI [11] [12],[13].Our approach 
aims to full proof this statement. 

IV. THE Methodology and TOOLS 

Security Onion (SO) is an Ubuntu-based distribution, used for purposes related to intrusion detection, network security 
and monitoring [14]. For this reason, it comes with pre-installed a number of suites and tools that are capable of 
detecting and protecting the network from abnormal and suspicious behaviors 

Snort is an open source NIDPS, included in SO. Snort offers the ability to write our own detection rules. Snort rules 
are expressions written in a specific syntax on a single or multiple lines [14]. A Snort rule is divided into two different 
parts. The first one is the rule header that contains the action to be performed, the protocol that the rule is related with, 
the source and destination IP addresses and port. At this section, the rule, also, contains the “direction” of the action 
(e.g. from an external network to our own or vice versa). The rest of the components within a Snort rule, are part of 
the second part of the rule named rule option. In this section, we can find more particular information related with the 
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packet that will trigger the action specified on the rule header section. An indicative Snort rule can be seen on the 

Fig-1 

Source Address 



Rule Header Rule Option 

Figure 1. Snort Rule Structure 

Regarding Rule Options section, the elements within it can be categorized into 4 different divisions. The first one is 
the general option that gives information related to the rule, however it does not affect detection. The payload options 
are the ones that search for specific data within the packet’s payload. There are, also, the non-payload options that 
search for data other than indicated in the packet’s payload. Finally, we have the post-detection options, which define 
options that occur after the initiation of the rule’s action. We present here some of the rule options used in this study. 

• The flow option indicates the same as the direction operators as it shows the flow of the traffic. When this option 
is followed by the keyword “established”, it indicates that traffic to the mentioned direction is established via a 
TCP session. 

• Depth follows the content option and it specifies in which depth the analyzer will stop searching for the specific 
content. For example, if we are having an option of depth: 5, the analyzer will stop searching after 5 bytes. In other 
words, depth is the number of the bytes that are contained within the content option. 

• The Offset option specifies from where we should start looking for a pattern within a packet. For example is the 
Offset has a value of 10, it indicates that Snort would start searching for the content specified after the 10 first 
bytes in the payload. 

• The Distance option in contrast to Depth option will inform Snort on how many bytes it should ignore after the 
previous match before start searching for the next specified content. 

• An identical option to the Distance one, is the Within option. It indicates how many bytes are between the previous 
and the next content that we want to match. 

• Finally there is the fast_pattem option which indicates to Snort that once it is found it should speed up the matching 
process. In other words, if the fast_pattem option accompanies a content option and the content specified is not 
found Snort will stop examining the rule even further. 

Sguil is an Open Source Software, included in SO, dedicated to monitoring the network. Sguil can work alongside 
with Snort in order to generate an analysis of the alerts triggered if a suspicious behavior exists [15],[16]. 

Wireshark is one of the most known free, open source programs used for packet analysis and is written in C/C++ 
programming language and originally developed by Gerald Combs [15]. Wireshark is very much alike Tcpdump, 
however Wireshark offers a Graphical User Interface (GUI). Wireshark not only performs live network traffic 
analysis, but it also gives the ability to record the traffic monitored and save it in the form of PCAP files. In addition, 
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it offers a great variety of filters that can be used in order to search in more details within a packet. As Snort, Wireshark 
uses Deep Packet Inspection for providing the most detailed information concerning network traffic. 

Metasploit is a framework that aids the discovery of unsecure and vulnerable systems and is mainly used as a part of 
Penetration Testing [16]. It provides users with a great variety of exploits that can be configured appropriately in order 
to target a specific system. After that, Metasploit will check whether the victim is vulnerable to the applied exploit. 
Once the target is found vulnerable, the attacker can generate a payload that targets the discovered exploitation in 
order to gain access. Metasploit can also provide add encoding to the payloads in order to pass unnoticed by the IDPS. 
Nmap is another tool that we are going to use as a Network Mapper. It’s main purpose is the discovery of hosts and 
the services that are running within a specific computer network [17]. For doing so, Nmap analyses the responses of 
the hosts after it has sent them specific crafted packets. Nmap can provide us with other features as well, other than 
just displaying hosts and services within a network. One we are going to use, is the use of scripts that can provide 
more detailed information about vulnerabilities that are linked to a specific protocol or service used by the targeted 
system and if they are applicable. It also reveals other useful functionalities like proving the attacker with the operating 
system of the victim, its MAC address and open ports that are vulnerable to attacks. There is also a GUI version of 
the Nmap called Zenmap. 

Wine is a free open source Windows Emulator, used to execute Windows applications on Linux or other operating 
systems, giving us the ability of utilizing the strong points of the current operating system and at the same time being 
able to run programs that are developed exclusively for Windows [18]. A useful property of the Wine software, which 
we are going to use in our implementation, is the ability to remotely access Windows applications. 

V. THE ATTACK VIRTUAL ENVIRONMENT 

Our study combines RegEx with Deep Packet Inspection (DPI) for matching patterns. We will prove that by using 
RegEx, with DPI instead of exact match pattern, we are able to improve the efficiency of the IDPS. 

In our scenario we use three systems, each of which with a different role. The first one is the Victim’s system, a 
Windows 7 Operating System. This system is deliberately vulnerable to the pair of threats EtemalBlue and 
DoublePulsar [22], [23]. More specifically, EtemalBlue is an exploit related to the SMB protocol vulnerability that 
can be used as a backdoor for outdated Windows systems. DoublePulsar, is a Trojan horse that can be used on 
Windows Operating Systems for opening backdoors and performing a variety of unauthorized actions such as injecting 
DLL files or executing shellcodes that can be malicious for the system. The threat is able of deleting itself, after the 
attacker has reached his goals. These are the ones used in the WannaCry incident, an incident that gained great publicly 
and proved once again that even a simple ransomware can strike the most valuable aspects of cybersecurity, the 
availability of systems and information. Before moving to the detailed methodology used, it will be beneficial to 
examine in more details what is the idea behind the EtemalBlue and DoublePulsar threats. 

EtemalBlue became famous as Server Message Block (SMB) protocol vulnerability. Although well-known for a big 
period of time to the U.S National Security Agent (NS A), it reached its publicity after leaked by the notorious hacker 
group Shadow Brokers on 2017. After the exploit became publicly known, it was used for delivering a number of 
attacks world-wide, among them the WannaCry and NotPetya cyberattacks. Last but not least, it was stated that 
EtemalBlue was part of spreading the banking trojan Retefe. EtemalBlue can be found on the Common Vulnerability 
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and Exposures database under the ID CVE-2017-0144. According to them, the exploit targets all the Windows 
versions starting from Windows XP and latest, as well as Windows Server 2008, 2012 and 2016. 

DoublePulsar, on the other hand, is a backdoor tool, developed by NS A and leaked by Shadow Brokers. The malware 
is a Windows-based threat that served as a cyber spying tool. As already mentioned, the malware works in combo 
with EtemalBlue exploit and it is delivered by abusing the TCP port 445, which is, also, the default port for the SMB 
protocol. Except from the SMB protocol, DoublePulsar, apart from the SMB protocol, can use another exploit related 
to the Remote Desktop Protocol (RDP) to spread itself, this time from the TCP and UDP port 3389. The payload that 
DoublePulsar can utilize, might actually be considerably large. The size of the payload has its origin to the need of 
the malware to recognize the victim’s system architecture in order to perform a successful attack. Although, the 
morphology of both payloads architecture (x86 and x64) are almost identical, using the wrong payload will result in 
an unsuccessful attempt attack. Later we are going to demonstrate how by sniffing and recording the traffic, with 
Wireshark and TCP Dump, we can monitor the abnormal behavior the malware causes to the network. The great 
danger with this malware is the fact that it offers a high level of flexibility to the attackers once it is installed in the 
system. It has been, also, reported that it is very difficult to get rid of it from the infected machines. It is capable of 
performing a triplet of actions, which is the source of the high level of control offered to the intruders. This commands 
are the ping , kill and exec , with the latest being extra critical, as it can be used for executing malicious code in order 
to infect even further the target. 

The EternalBlue and DoublePulsar attacks are used alongside, the first one for creating a backdoor and the second one 
helping the injection of DLL files by using payloads. For exploiting the SMB vulnerability and performing a 
DoublePulsar attack, we need a second system in the role of the attacker, running Kali Linux Operating System. A 
third machine, acting as IDPS, in our case SO via Snort, will monitor the network traffic and record it with the help 
of Wireshark. At the same time the Sguil interface under SO, will help us be aware of the attack, by generating alerts 
triggering from Snort rules. Later we are going to evaluate the detection rules related to the SMB vulnerability with 
the use of the PCAP file, recorded by Wireshark during the attack. For this will write a Snort rule using a RegExp, 
capable of helping the IDPS perform a better matching pattern. At the next step we are going to evaluate this rule by 
replaying the PCAP file and check the reaction of Sguil, with the new rule. A good knowledge and understanding of 
the Cyber Exploitation Life Cycle is also needed in order to be able to perform the attack and recognize its “footprint” 
within the PCAP file. Finally, we will present the results and our evaluations regarding the outcomes of the method 
and we are going to show that proper application of RegEx matching pattern actually improves the performance of 
the IDPS. 


A briefdescription of the attack environment 

There are three different virtual machines. The first system is a Windows 7 virtual machine, named as IE8-Win7, 
which will play the role of the victim and has the following characteristics: 



Victim 

Attacker 

IDS 

Operating System 

Windows 7 Enterprise Service 

Pack 1 

: Kali Linux 4.6.0 (Kali- 

Rolling) 

Security Onion 14.04.5.4 

Architecture 

32 bits 

64 bits 

64 bits 
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Base Memory (RAM) 

1GB 

6 GB 

4 GB 

Processors 

1CPU 

2 CPUs 

2 CPUs 

Network Adapters 

Host-only Adapter (ethl) 

NAT (ethO) and Host-only 

Adapter (ethl). 

NAT (ethO) and Host-only 

Adapter (ethl) 


The Victim’s machine operating system is not protected by any form of software (e.g. firewall, anti-virus). We had, 
to adjust the environmental variables for enabling the SMB protocol [24] by using the command on the PowerShell 
as administrator shown on Fig.2. The DWORD value set on 1 indicates the fact that the SMB protocol is activated. If 
the value was set on 0, the SMB protocol would be disabled for this machine. 


Administrator: Command Prompt - powershell 


| ■=■ || B | 


H 


Microsoft Uindous[Uersion 6.1.7601] 
[Copyright 2009 Microsoft Corporation. 


fill rights reserved. 


C:\Uindows\systen32 >powershe11 
Uindous PouerShell 

Copyright CO 2009 Microsoft Corporation, fill rights reserved. 


PS C:\Uindous\system32> Set-ItenProperty -Path ,p HKLM:\SVSTEM\CurrentControlS| 
ervicesSLannanServerSParaneters" SMB -Type DU0RD -Ualue 1 -Force 
PS C:\Uindous\system32> _ 


Figure 2: Command used for enabling SMB protocol on Windows 7 

The second virtual machine is the attacker. For this purpose, we are going to use a Kali Linux virtual machine, named 
KaliLinux. 

In contrast, the attacker’s systems uses two network interfaces, ethO set as NAT, which serves as the management 
interface, and ethl set as Host-only Adapter, which serves as the sniffing interface. Concerning software needs, we 
make sure that the systems are fully updated and the Eternalblue and DoublePulsar exploits for Metasploit [25], as 
well as Wine are installed, updated and configured properly: 

Finally, we have the IDS, sniffing system and on the following table 1, we have listed the systems used, their role and 
their IP addresses. 


TABLE I 

The IP’s o the environment 


System 

Role 

IP Address 

IE8 Windows 7 

Victim 

192.168.2.101 

KaliLinux 

Attacker 

192.168.2.103/192.168.2.104 

Security Onion Lab 

Sniffer-IDS 

192.168.2.102 


As noticed, the attacker systems has two different IP addresses. That was not necessary and does affect the results of 
this research 
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VI. Implementing the attack 

We use the Metasploit Framework to exploit the target system using the vulnerable SMB protocol and the pair of 
EternalBlue and DoublePulsar threats. 


The steps of the attack: 


We set SO record and monitor the traffic to indicate the alerts generated during the attack. For doing so, we use 
Wireshark on both ethO and ethl. We also launch Sguil, which will monitor the network interfaces and displays all 
evolving alerts and corresponding information. 

Install the Wine software to our Kali Linux machine because that the Kali Linux operating system can provide us a 


variety of pre-installed tools. 



MMV 

wmmmm 




Applications ▼ Places ▼ GD Terminal ▼ 


Eternalblue- 

Doublepuls - 

ar-Metasp... 


VBOXADDIT 
_ 10NS_ 
^ 1.30_11... 


:~# sudo apt-get install wine32 
Reading package lists... Done 
Building dependency tree 
Reading state information... Done 

wine32:i386 is already the newest version (2.0.3-1). 

The following packages were automatically installed and are no longer required: 
libassS libchromaprintO libeburl28-l Iibfftw3-single3 libmimicG 
libopencv-calib3d2.4v5 libopencv-contrib2.4v5 libopencv-core2.4v5 
libopencv-features2d2.4v5 libopencv-flann2.4v5 libopencv-highgui2.4-debO 
libopencv-imgproc2.4v5 libopencv-legacy2.4v5 libopencv-ml2.4v5 
libopencv-objdetect2.4v5 libopencv-video2.4v5 libschroedinger-1.0-0 
libva-drml libva-xll-1 libval libwildmidil libx265-79 
Use 'sudo apt autoremove' to remove them. 

0 upgraded, 0 newly installed, 0 to remove and 1850 not upgraded. 

:~# I 


Scan the network for discovering our target and the services that they use. More specifically, since we have chosed 
the type of the attacks that we are going to use, we can make use of more advanced options in order to get more precise 
information regarding the potential targets 


:~# nmap -p445 --script smb-vuln* 192.168.2.Q/24 


Figure 15: Nmap Command 

The Nmap tool, which will scan the 192.168.2.0/24 network range, searching for open 445 ports that can be linked 
with the SMB vulnerability. With the —script option we tell the Nmap scanner to provide us any information on the 
SMB vulnerability that can be used on the victim, when it will find an open SMB port. The outcome on our terminal, 
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once Nmap has finished scanning the whole network range that we defined is the one on Fig. 16 



children. 


^cations ▼ Places ▼ Q Terminal ▼ 


Nmap scan report for 192.168.2.101 
Wk Host is up (0.00030s latency). 

NT PORT state service 
I 445/tcp open microsoft-ds 

*- |mAC Address: 08:00:27:3F:03:BC (Oracle VirtualBox virtual NIC) 

34' 

_ Host script results: 

| || smb-vuln-cve2009-3103: 
j VULNERABLE: 

III |j SMBv2 exploit (CVE-2009-3103, Microsoft Security Advisory 975497) 

W cor j state: VULNERABLE 

| IDs: CVE:CVE-2009-3103 

)5f j Array index error in the SMBv2 protocol implementation in srv2.sys in Microsoft Windows Vi 

|Sta Gold, SP1, and SP2, 

rj ~| | Windows Server 2008 Gold and SP2, and Windows 7 RC allows remote attackers to execute arbi 

I 7 j |trary code or cause a 

d. | denial of service (system crash) via an & (ampersand) character in a Process ID High heade 

£»5j r field in a NEGOTIATE 

| PROTOCOL REQUEST packet, which triggers an attempted dereference of an out-of-bounds memor 

y location, 

p | aka "SMBv2 Negotiation Vulnerability." 

d| 

| Disclosure date: 2009-09-08 

== |j References: 

https://eve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-3103 

■ http://www.eve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-3103 _ 

h.j smb-vuln-msl0-054: false 
j smb-vuln- ms 10- 061: NT STATUS ACCESS DENIED 


Figure 16: Nmap Outcome. 


The Nmap come up with a potential target that uses the IP address 192.168.2.101. The 445 port on this machine is 
open and we can understand that our target is a Windows Virtual Machine (we can conclude that our target is a virtual 
machine from the MAC address that we get back). Also, since it found an open SMB port, the next thing that it will, 
, provide us is the different vulnerabilities that are related with the SMB protocol and if they are applicable on the 
target. In our case, the vulnerability that can be used to this specific target is the CVE-2009-3103, vulnerability found 
on the SMB version 2. There are two more vulnerabilities, the MS-10-054 and MS-10-061, however, Nmap suggests 


that they cannot be used in order to compromise the machine. 

At that point, we need to gather some useful information for the victim; however, we also need to have a more clear 
view of the Windows Operating System that the victim runs. We will use Nmap, once again with the following 
command. This command will help us determine the operating system on host 192.168.2.101 (Fig. 17). 


:~# nmap -0 192.168.2.101 


Figure 17: Nmap Command for OS detection 

The outcome in Fig 18: 


Running: Microsoft Windows 7|2008|8.1 

OS CPE: epe:/o:microsoft:windows?::- epe:/o:microsoft:windows?::spl cpe:/o:mic 
rosoft:windows_server_2008::spl epe:/o:microsoft:windows_server_2008:r2 epe:/o:m 
icrosoft:windows_8 epe:/o:microsoft:windows_8.1 

OS details: Microsoft Windows 7 SP0 - SP1, Windows Server 2008 SP1, Windows Servi 
er 2O08 R2 r Windows 8, or Windows 8.1 Update 1 
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Figure 18: OS Detection Outcome 

Nmap informs us that the targeted operating system is either Windows 7/8 or Windows Server 2008, helping us to 
exclude other types of Windows versions, which are affected equally by this vulnerability.By using Nmap we were 
able to detect that the target has an open 445 port, vulnerable to the CVE-2009-3103 and is running either Windows 
7/8 or Windows Server 2008 OS with IP address 192.168.2.101. All of these will help us to launch our attack. 

So, for launching the attack, we need to start the Metasploit Framework. Before that, however, we have to start the 
database with the components needed for the Metasploit Framework by using the “ servicepostgresqlstart” command 
as shown below. 



Figure 19: Starting Metasploit Framework. 

As already mentioned we will use the exploits EternalBlue and DoublePulsar. Below (Fig 20) we see the exploit and 
the available options we can utilize in order to prepare our payload. 
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Figure 20: Options available for EtemalBlue and Doublepulsar. 

First we need to set the path directories for both the EternalBlue and Doublepulsar as seen on Fig 21. 


exploit! ) > set DOUBLEPULSARPATH /root/Desktop/Eternalblue-Doublepulsar-Metas \rl 

ploit/deps\r 

DOUBLEPULSARPATH => /root/Desktop/Eternalblue-Doublepulsar-Metasploit/deps 

msf exploit! ) > set ETERNALBLUEPATH /root/Desktop/Eternalblue-Doublepulsar-Metasp I 

\rloit/deps\r 

ETERNALBLUEPATH => /root/Desktop/Eternalblue-Doublepulsar-Metasploit/deps 

msf exploit! ) > | | 


Figure 21: Setting the path for EtemalBlue and DoublePulsar. 

If we take a look at the paths that we have set, within the deps folder we will find many dll and exe files that are used 
in order launch the attack. This exe files as well as the “PROCESSINJECT” option that we will use later on, are the 
reasons why we had to install the Wine Emulator at the beginning of the process. 

Next, we are going to set the rest of the needed options. We will need to define the name of the process, which will 
be injected into the victim under the option of the PROCESSINJECT. Also we will have to set the target’s IP address 
and the target’s system architecture. Under the option EXPLOIT TARGET, we will have to define the operating 
system of the victim’s machine. Last but not least, we will have to set the PAYLOAD that will be used and the IP 
address of the attacker system in order to create the reverse shell, for gaining remote access to the victim’s side. 
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Figure 22: Setting the rest of the parameters needed. 

root@Kali: - O © © 

FiLe Edit View Search Terminal HeLp 


*] Started reverse TCP handler on 192.168.2.1Q3:4444 

!*] 192.168.2.101:445 - Generating Eternalblue XML data 

*] 192.168.2.101:445 - Generating Doublepulsar XML data 

*] 192.168.2.101:445 - Generating payload DLL for Doublepulsar 

*] 192.168.2.101:445 - Writing DLL in /root/.wine/drivec/eternal11.dll 

+] 192.168.2.101:445 - Launching Eternalblue... 

>] 192.168.2.101:445 - Pwned! Eternalblue success] 

*] 192.168.2.101:445 - Launching Doublepulsar... 

>] Sending stage (957999 bytes) to 192.168.2.101 

*] Meterpreter session 1 opened (192.168.2.103:4444 -> 192.168.2.101:49158) at 
018-01-21 15:04:10 -0500 

+] 192.168.2.101:445 - Remote code executed... 3... 2... 1... 


Figure 23: Outcome of successful Exploit. 

After we have set the needed parameters as shown on Figure 22, we are ready to exploit the target by typing on our 
Metasploit terminal exploit. Our terminal now displays the above shown on Figure 23 

We can see the steps that performed with the Metasploit Framework and the settings we have done. First of all, XML 
data will be generated for both vulnerabilities. After that, Metasploit will create a DLL file that will be used as the 
DoublePulsar’s payload. Finally Metasploit will launch both EternalBlue and DoublePulsar and it will notify us for 
the outcome of the exploitation. In our case, the exploitation was successful and we can perform the followings shown 
on Fig.24 in order to confirm it and we can, also, start the shell to access remotely our victim’s machine. 
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meterpreter > getuid\r 
Server username: IE8Win7\IEUser 
meterpreter > shell\r 
Process 900 created. 

Channel 1 created. 

Microsoft Windows [Version 6.1.7601] 

Copyright (c) 2009 Microsoft Corporation. All rights reserved. 
C:\Windows\system32>| 


Figure 24: Starting the shell to gain remote access. 



Figure 25: Processes running on the Victim’s side. 

Now, if we take a look at the victim’s process, we will encounter a well-known element. For displaying the process 
that are running on the victim’s machine, we will need to open a command line, type PowerShell and next get-process. 
Our terminal will display an outcome like the one on Fig25. Within the processes that are running on the Windows 7 
system, we can see a process under the name “explorer”. This is the process we generated using the Metasploitable 
Framework when we set the option “INJECTPROCESS on the previous steps”. 

Once the attack was successful, we can check on the sniffing machine the outcomes of the Sguil, that was monitoring 
the interfaces the whole time. The alerts that were triggered during the process are the ones shown on the Fig26. 
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SGUIl 0 9 0 Connected lo localhost 


j from ethO an... ■ Terminal • hirokoc^hlro... 


$ [?1 28 Ack. 12:58 


File Query Reports Sound: Off ServerName: locatiost userName: hiroko LiserlD: 2 
Reallime fcvents | Escalated Events] 


2017-12-28 12:58:01 GMT 


201 7-12-2812:25:55 192.108.2.104 47014 192.108.2.101 



11 NE1B1Q5 Microsoft SRV2.SYS SMB Negotiate ProcesslO function table Uereference 


1 hkoko-w... 

2 hiroko-w... 
14 haokovw... 

1 hiroko-vir... 
1 hiroko-vir... 


5.470 

5.989 

5.990 


7017 17 2817:75:56 147.168.7 104 

2017-12.2812:25:56 192.168.2.104 

2017-12-28 12:53:39 192.168.2.104 

7017 17 78 17:55:40 147.168.7 104 

2017-12-2812:53:58 192.168.2.104 

2017-12-28 12:53:58 192.168.2.104 


47016 147.168.7.101 

47016 192.168.2.101 

47074 192.168.2.101 

147.168.7.101 

192.168.2.101 
192.168.2.101 


47078 


49160 

49160 


Oft NFTBI05 5MB 05 IPCI share.iccevi 

GPL NCTDI05 SM0-D5 win reg create tree attempt 

GPL NETBIOS 5MB-051 PCS Unicode share access 

FT SHFl l fOOF Pav.iblr f,i8 with No Olivet TCP Shrtkodr 

CT POLICY PE EXE or OLL Wevlows file download 

GPL SKELLCODE X86 inc ebx NOOP 


IP Ri-.olutmn | Agent St.ilu-. ] Snort 5t.ilr.lK-. | Sy-.trm M-.g-. | Lher M-.gv ] 

f Reverse ONS 5 Enable External ONS 


Show Packet Data <* Show Rule 


SrclP: 

5rc Name: 


IHt IP: 
O-.t N.ur 


Whois Query; • None SrclP 


alert tep T EXTERNAL NET any-> WIOME NET445(msg:Xr NETBIOS Mtcrosoft 5RV2.SYS SMB Negotiate ProcessID runaion Table 
Dereference': fkwr.lo 44.-rwr.irHjbtelM.-d; content:" |FF 534d 42 72|": otl^c*t:4; depth:5; content:!" 1 00 001“; delance:7; with«i:2. 
relerencr:url.wwwi-ipki# clh.catn/« > xploit’./14674/; rrh-rrncr:ui l.www. iiik insott.<oin/ti-<hn(4/$Muiity/bufc-tin/m-.OO 050.rri-.px; 
relefencr:cvr,?004 5103; cLivctyprc.iltmipti-d uvr; vnl:?01?063; rev:1;) 
rncm/cervrrjd.'it.Vseriirityonion/mlr'./ha'okn virtunlMx rthi 1/downloaded, ru lev line 7711 


Flags Offset TTL ChkSum 


| Source 
Port 


Oest 

Port 


U A P R S F 

R R R C 5 5 Y I 


Ack # Offset Res Window Urp ChkSum 


■ Text NoCase 


Figure 26: Alerts generated by the attack launched. 

As we can see, there are several alerts related to the SMB protocol, however none of them provide us any further 
information about the alerts linked to the EtemalBlue or DoublePulsar exploits. For that reason, we are going to search 
the PCAP file recorded by Wireshark in order to collect more information, identify the attack’s pattern and construct 
our own Snort rule using RegEx, 

Knowing that the EternalBlue and DoublePulsar are taking advantage of the SMB protocol, we are going to filter the 
PCAP file in order to examine it easier. After taking a careful look at the outcomes of the filtering, we notice that there 
is a suspicious repetition of malformed SMB packets. By examining the Reassembled TCP data on Fig.27, we can 


observe that there is a pattern that we can take advantage in order to write the Snort rule. 

fa [New File - Mousepadl |/tl etemalblue.pcap (Wlr... ■ Terminal - rootrgihiroki-... ■ Terminal - hiroki:§>hirok... 


f = ] 28 Jan. 15:31 


File Edit View Go Capture Analyse Statistics Telephony Toole Internal* Help 

© ® ■ OX..Q _ ED □ 


o d □ y 0 ffl i»j 


filter: smb 
No. Time 


v Expression... Gear Appl , Save 
Destination Protoco' Lengtl Info 


481 438.436939 
488 438.437393 

491 418.437473 
494 438.437435 
497 430.437445 


1514 Trans2 Secondary Request(Maiformed Packet] 
1514 Tran$2 Secondary Request(Malformed Packet] 
1514 Trans? Secondary Request(Maiformed Packet] 
1514 Trans2 Secondary Request(Malformed Packet) 
1514 Trans? Secondary ReouestIMai formed Packet 1 


192.168.2.103 192.168.2.101 SMB 

192.168.2.103 192.168.2.101 SMB 

192.168.7.103 192.168.7.101 SMB 

192.168.2.103 192.168.2.101 SMB 

147.168.7.183 147.168.7.101 SMB 

► Frame 478: 1514 bytes on wire (12117 bits), 1514 bytes captured (1211? bits] 

► Ethernet II. Src: CadmusCo f4:18:31 (08:00:27:f4:18:311, Dst: CadnusCo 3f;03;bc (08:80:27:3f:03:bc) 

► Internet Protocol Version 4. Src: 192.168.2.103 <192.168.2.183). Dst: 192.168.2.101 <192.168.2.181) 

► Transmission Control Protocol, Src Port: 43124 <43124), Dst Port; 44S <445). Seq: 8589, Ack: 453, len 

► (3 Reassenbled TCP Segments (4153 bytes): #476(1448). #477(1448). #478(1257)] 

► NetBIOS Session Service 

► SMB (Server Message Block Protocol) 


* (Expert Info (Error/Malformed); Malformed Packet (Exception occurred)] 

0000 

08 00 10 35 ff 53 4d 42 

33 00 00 00 00 18 07 cO 

...5.5MB 3. 

0018 

00 00 00 00 00 00 00 00 

oo oo oo oo oo os ff te 


0020 

00 08 41 00 09 00 00 00 

10 00 00 08 00 00 00 00 

. .A. 

00 3B 

10 35 08 c)0 13 8B 00 80 

10 4B 34 77 76 78 4a 47 

.5.H4wvxJB 

0040 

75 51 67 41 37 74 3? 4d 

4c ?b 3? 56 70 58 6c 36 

uQqA7t?H 1 t-?VpXlfi 

0059 

46 31 6a 52 47 38 2b 48 

38 43 78 65 63 78 55 59 

F 1 jRC«8»H RfxerxUY 

0060 

79 Ge Ge 72 63 58 63 6b 

71 6a 30 4c 79 6b 73 61 

ynnrcXck qJOLyksa 


Frame (1514 bytes) Reassembled TCP (4153 bytes) j 

file: "/home/hirokl/Desktop/eter... Packets: 1945 Displayed: 187 (9.63L) Load time: 0:00.018 


Figure 27: Apply SMB filter on Wireshark. 

The first of these elements is a long sequence of 25 bytes, which includes the word SMB that will serve as our first 
content for the rule. Right after there is a sequence of bytes seen on every malformed SMB packet that can be useful 
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for detecting this certain behavior. Lastly, when it comes to the mutual elements of the malformed packets, we identify 
a third useful sequence of bytes for constructing the rule Fig28. 

[SGUIL 0.9,0 C&mnec eternalWue.pcap [Wi... fcjj 481 430.436939 MSS!.. M Tefminal - root@hl rok... (g Terminal - WrDki@hMn.. 28 Jan, 17:01 
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Figure 28: Elements used for assembling Snort rule. 

There is a fourth pattern at the end of each packet, however it is not that clear as the rest of the patterns, in the sense 
that we have to figure out a way for using it efficiently. Here is where Regular Expressions can come in handy. Even 
though the content of the fourth element is not the same for all the malformed packets, we can use a Regular 
Expression that will search for a combination match that starts with upper-case and lower-case letters, digits from 0 
to 9 and use a quantifier, since for each malformed packet the expression is repeated multiple times. On Fig.28 there 
is the sample of the elements that guided us to assemble this Snort rule. 

The Regular Expression rule will be like: 

/ A [a-zA-Z0-9+/]{500, }/R 
A : Starts the match from the beginning of the string. 

[ ]: Matches every element contained on the square brackets, 
a-z: Matches every lower-case letter. 

A-Z: Matches every upper-case letter. 

0-9: Matches every digit. 

+ and /: Matches with + or / symbol. 

{500, }: Matches the pattern within the curly brackets 500 or more times. 

/.. ./R: This is a Snort Regular Expression flag, which is equal of using the option distanced 

Having found the main elements of our rule will can now write it. Since, it is a “private” rule, we will have to 

contain the rule in the local.rule file found in the /etc/nsm/rules directory. In order to edit the file, we will need to 
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have root rights and for that reason we have to use the command sudo -s. 



Figure 29: Snort rule for EtemalBlue. 

Using the Snort Rule Options, our rule is shown on Fig.29. Before putting it into test, we will have to update the rules 
set used by Snort in order for our rule to take effect. This can be done by using the command “sudo rule-update 
After Bamyard2 and Snort have been successfully restarted, we can use the following command on Fig.30, for 
replaying the PCAP file. 



Figure 30: Tcpreplay command. 

This command essentially is telling Tcpreplay to replay the PCAP file found on this specific directory at full speed (- 
t) at the interface ethl (-i ethl) and replay it one time (-1 1). 


After the replay has finished, the output on the Sguil environment is shown on Fig.31. 
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Figure 31: Alerts shown on the Sguil Environment with the rule using RegEx. 

For checking that we have correctly captured the malformed packet, we are going to use as a filter on Wireshark the 
ID number shown on the Packet Data field from Sguil. The result is shown on the Figure 32. 
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Figure 32: Checking the accuracy of the Snort rule. 

As we can confirm, the packet was detected successfully and it contains all the elements that we used on our Snort 
rule. 

Now, we are going to test the effectiveness of the rule without the usa of Regular Expression. Once again, we are 
going to edit the local.rule file and update the rules to submit the changes to the system. This time the rule will be 
like the one displayed on Figure 33. 



Figure 33: Snort rule without Regular Expressions. 

Having applied the changes, we replay the PCAP file to check the outcome in Sguil, exactly as we did before with the 
Regular Expression. The outcome is shown on Figure 34. We notice that the alerts linked to the rule are significantly 
more than when the rule was with Regular Expressions. Now we are going to use the ID from the Packet Data field 
on Sguil as a filter to the Wireshark recorded file, in order to check what kind of packet is associated with the alert. 
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Figure 34: Alerts triggered after modifying the Snort rule. 


We can see on Figure 35 that the rule did not have as a result the same packets as before. In fact, it detects a TCP 
packet, which cannot help us understand clearly as previously, why the alert was triggered. This makes clear that the 
methodology that includes Regular Expressions optimizes IDPS rules and provides a more precise view of the incident 
and the origins of the attack. 
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Figure 35: Outcome in Wireshark without Regular Expressions on Snort rule 

VII. Conclusion 

IDPS is not a silver bullet as it needs appropriate tuning and configuration, or otherwise, there is the danger of 
producing a lot of false negative and false positive alerts. Apart from tuning and configuration, IDPS should be 
equipped with advanced techniques in order to enhance its performance and effectiveness without consuming an 
unbearable amount of network’s resources. In the current implementation, we wanted to prove the positive aspects 
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that DPI along with RegExp offer when used effectively on IDPS. DPI is the highest of the Packet Inspection Levels 
and it is used when we need a detailed view of the network, as this technique provides information concerning all OSI 
layers. RegExp applied to IDPS gives an additional tool for making the detection rules more effective and precise, 
helping us understand in a timeless and effortless manner why the alert is triggered. This is a critical factor when the 
traffic is heavy and various alerts to need be examined. 

Our implementation proved that DPI along with RegExp can be beneficial when used with IDPS. We came up with a 
case scenario where we chose to work with Snort and Wireshark for record and examine the network packets in detail, 
so that we could identify the malicious code. We selected the EternalBlue and DoublePulsar vulnerabilities for that 
purpose, which both can take advantage of the SMB protocol. The goal was to to launch an EternalBlue - DoublePulsar 
attack using Kali Linux to the Windows machine, and watch how the IDPS would respond with two different detection 
rules, with and without the use of RegExp. 

The outcome was that by using DPI with RegExp, we could construct rules that are more precise than rules using 
exclusively exact match content. Another positive effect was that it reduced the amount of alerts only to ones 
concerning the incident. We were able to identify the elements related to the EternalBlue vulnerability and construct 
a decent and functional rule that was tested for its effectiveness. On a future expansion, we will present the results for 
the DoublePulsar vulnerability. 
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APENDIXA 


Characters 


Character 

Description 

Example - Sample Match 

\d 

Using this regular expression character will 

have as a result to display all the digit 

characters. 

If we have a file with names and phone numbers, using this character will 

give us back only the phone numbers without the corresponding names. 

\D 

This character expression will give us back all 

the characters that are not digits. 

Continuing from our previous example, this time if we use this notation, we 

get the list of the names within the file without their phone numbers. 

\w 

Using this expression will have as a result to get 

back all the word characters, which include 

ASCII characters, digits and underscores. This 

pattern is not convenient if we need to search 

for non ASCII characters, such as e or n. 

If we use the \w pattern to the phrase: “Good momig, Amelie“we will get 

back every single character except from the comma (which is not a word 

character) and the e (which is not an ASCII character). Also, the spaces 

between the words will, be considered a mismatch for this expression. 

\W 

This expression will display for us all the 

characters that do not belong to the category of 

word characters. 

Continuing from the previous example, this time, using the YW expression, 

we will have a match with comma, e and the spaces within the phrase. 

\s 

Using this expression will have as a result to get 

back all the whitespace characters, in other 

words, the spaces, tabs or line breaks. 


\S 

This notation will match every single character 

that is not a whitespace character. 


[ A ] 

The square brackets that contain this specific 

symbol at the beginning indicate that will 

match any pattern that is not included within 

them. As a whole these specific symbols are 

called negated set. 

For example, if we are using the Regular Expression “[ A abct]”, will have as 

a result to match words such as “hello” that do not contain the letters within 

the square brackets and after the caret symbol. A mismatch for this Regular 

Expression will be a word such as “bat” that contains all the letters that are 

within the negated set. 


The full stop symbol will match each and every 

character. 

For example if we have the Regular Expression “.a”, it will match each and 

every pattern that contains an “a” letter. 


Anchors: In the Regular Expression syntax, anchors are special characters that help us indicate the position within a string. In other words, when 


they are used they define the position of the matching. 


Character 

Description 

Example - Sample Match 

- 

When used it will match the beginning of a 

string or a line. We have to point out that A and 

[ A ] is not the same expression, since when it is 

met within brackets is equal to a “not” 

expression. 

We are using the sample match A a, which searches for strings that begin 

with the “a” letter. The string that contains the phrase “apples are red” is 

a perfect match for this sample match, but not the string that contains the 

phrase: “bananas are yellow”. 
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$ 

The dollar symbol, in contrast with the previous 

one, will match the end of a string or a line. 

This time, we will use the match sample ow$ that will match every string 

or line that ends in “ow”. Taking the strings used in the previous example, 

“bananas are yellow” will be a match this time for the pattern we used. 

\b 

This expression is known as word boundary. It 

is used for matching word that have a boundary 

position. 

Let’s suppose that we want to search in a list of words for the lines that 

contain words that end in “ow”. We will use the match sample ow\b. 

After the search finishes, we will have as an output words like “yellow”, 

“low” or “shadow” and work as boundaries between other words. 

\B 

This expression will match all the characters 

except from the ones that work us boundaries. 

We are using the h\B match sample this time. If we apply this search 

pattern on the phrase “He tries to match his hat with the watch” the words 

that are a match are “his”, “hat” and “the”, “watch” is not a match since 

it is a boundary word and “He”, as well, because it starts with a capital 

letter 


Quantifiers: are expressions that help us to search for a pattern that appears a specific number of times. 


Character 

Description 

Example - Sample Match 

* 

This symbol must not be mistaken as a wildcard. 

In the Regular Expressions Syntax “*” matches 

zero one or more times the pattern that exists 

before it. 


+ 

Matches one or more times the character that it 

us after it. It has to be pointed out that there is a 

small difference between “+” and “*”, since the 

first matches the pattern that is located before it 

one or more times, whereas the second one 

matches it if appears zero or more times. 


? 

It will match the exactly one character that is 

located before it. This character is optional and 

there is no need for a match. 

For instance, if we have the sample match “travel?ling”, it will have as a 

result to match both the words “traveling” and “travelling”. 

? 

With a different use this symbol has another 

outcome. Usually, the symbols belonging on the 

quantifiers will try to match a pattern as many 

times as possible. In this case, this symbol, 

known as lazy quantifier, will match a pattern as 

few times as possible. 

For example, if we have the Regular Expression “a\w+?” will match any 

word that contains the “a” letter and afterward follows one more ASCII 

character. The outcome of the above will match words such as “alter”, 

“calculus” etc, but it will ignore words such as “a” because it is a sole 

letter and nothing follows, or the word “America”, since the Regular 

Expression is case sensitive and the second “a” in the word is the last 

letter followed by nothing. 

i 

This symbol is called alteration, meaning that it 

can be used within a group for matching various 

cases. 

For example, we have the Regular Expression “c(a u)f’. This will match 

both the words “cat” and “cut”, because, essentially we are telling the 

Regular Expression that we like to match the above combination, either 

if it contains the letter “a” or “u”. 

{} 

The curly brackets contain the times that the 

previous indicator is needed to be matched. 

For example, we have the Regular Expression “a\w{l,3}” that will much 

any word that contains the letter “a” and after there are 1 to 3 ASCII 

characters. In other words, it will match words such as “alter”, “camel” 

but will not match the articles “a” and “an”. 
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Abstract —Cloud computing offers to users worldwide a low 
cost on-demand services, according to their requirements. In the 
recent years, the rapid growth and service quality of cloud 
computing has made it an attractive technology for different 
Tech Companies. However with the growing number of data 
centers resources, high levels of energy cost are being consumed 
with more carbon emissions in the air. For instance, the Google 
data center estimation of electric power consumption is 
equivalent to the energy requirement of a small sized city. Also, 
even if the virtualization of resources in cloud computing 
datacenters may reduce the number of physical machines and 
hardware equipments cost, it is still restrained by energy 
consumption issue. Energy efficiency has become a major 
concern for today’s cloud datacenter researchers, with a 
simultaneous improvement of the cloud service quality and 
reducing operation cost. This paper analyses and discusses the 
literature review of works related to the contribution of energy 
efficiency enhancement in cloud computing datacenters. The 
main objective is to have the best management of the involved 
physical machines which host the virtual ones in the cloud 
datacenters. 

Keywords — Cloud , green cloud; energy; data center; energy 
consumption; virtualization; optimization; resource allocation; 
direct migration; consolidation; virtual machine; physical machine. 

I. Context and issues 

Cloud computing is among the important technologies of 
the present time. It is modeled to provide services to users [1] 
such as computing, software, data access and storage without 
any prior knowledge of the physical location and 
configuration of the server providing these services. Large and 
virtualized data centers contain multiple elements as servers, 
networking equipment, cooling systems that consume high 
energy to provide efficient and reliable services to their clients 
[ 2 ]. 

A report by the International Data Corporation (IDC) 
predicts that the space of the Worldwide datacenters will 
continue to rise and reach about 1.94 billion square feet in 
2018 while it was about 1.58 billion square feet in 2013 [3]. 


For example, about 70 billion kilowatt-hours of electricity 
in 2014 have been consumed by data centers in the United 
States, which represents 2 percent of the total country energy 
consumption. Moreover, it is expected that the US energy use 
will increase by 4% between 2014 and 2020 [4]. 

The high consumption of energy increases operating costs 
for service providers and users. Also, a large amount of carbon 
dioxide is emitted which harmfully influences the 
environment. On this concern, major research operations on 
the energy consumption of the data centers were launched in 
the last few years. This problem is not only caused by the 
large amount of physical resources, but it resides in using 
them in an optimized manner. 

Data collected from experiences conducted on more than 
5000 servers has shown that server capacity used is between 
10 and 50%, which may lead to additional expenses [5]. 
Besides, the consumption of full capacity on a running server 
may reach up to 70% of its power [6]. 

Virtualization is a very effective high technology that 
enables cloud providers to reduce energy inefficiency since it 
allows them to manage multiple instances of virtual machines 
(VMs) in one server. In fact, by using migration and server 
consolidation methods, VMs can be dynamically transferred, 
in real time, to a minimum number of physical servers based 
on their resources requirements, such as the CPU and 
memory. Also, if the server has no VMs, it will be turned off. 

These VMs displacements may offer good load balancing 
possibilities. It can be done without service disruption so that 
it complies with service level agreements (SLA) and the 
overall Quality Of virtual Services (QoS). However, 
consolidating badly managed VMs can lead to performance 
degradation when the demand of resources usage is increased. 

Since QoS defined by SLA (such as latency, downtime, 
affinity, placement, response time, data duplication, etc) is 
largely needed, cloud providers must find a middle ground 
between the energy performance of data centers and SLA. 
Therefore, effective solutions to manage data centers 
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resources are required to minimize the energy consumption 
and maintain a good performance in cloud datacenters 
environments. In fact, reducing the energy means reducing 
electricity costs, C02 gas emissions, and contributing to the 
green computing aspect. 

Green computing is related to several concepts such as: 
Power management, efficient algorithms, resources allocation, 
virtualization, etc. 

Researchers of energy efficiency have proved that a proper 
scheduling and management of servers in the cloud 
datacenters can efficiently reduce total resources utilization 
[7]. The experiment shows that various server components 
impact power consumption in the datacenter including CPU, 
memory, disk drives, etc [8] as mentioned in the figure 1. 



Fig. 1. Range of Power Consumption of Various Server Components 

These statistics motivates research efforts in the field of 
energy consumption based on multiple substantial energy 
parameters. 

II. Related works 



Sandeep Saxena, et al. [11] proposed a task planning 
architecture for cloud energy efficiency based on assigning 
cloud demand to the most appropriate cloud server. 

Noting that, Green Cloud Scheduling [12] refers to the process 
of planning cloud service requests in the best way possible in 
order to complete the task in time allocated with minimal use 
of energy resources. And the energy consumption is increased 
if the task has not been properly scheduled. This scheduling 
system [2] first classifies incoming cloud service requests into 
different categories based on certain predefined parameters 
(Bandwidth, Processor, Security and Confidence Level) and 
assigns each of the service requests to the cluster that is best 
suited to the corresponding category. 

Edouard Outin, et al. [13] studied maximization of energy 
efficiency by consolidating VM, allocating accurate resources 
or adjusting VM usage. They adopted a system that uses 
genetic algorithm to optimize energy consumption in cloud 
and machine learning technique to improve the fitness 
function related to the distributed cluster of the server. As 
result, this energy model, can be integrated into the simulator 
or other models and it will evaluate the cloud configurations 
with more precision. Yet, it does not take into account the 
overhead energy costs of live VMs migration. 


For several years, great efforts have been devoted to the 
study of minimizing the energy consumption of data center 
servers. Current studies indicate that if this consumption 
continues as it is today, the energy cost consumed by a server 
during its lifetime will quickly exceed the cost of the server 
itself. 

Er. Yashi Goyal, et al. [9] presented an approach to select 
VM for migration and a host selection policy to reassign this 
VM. They rely on the fact that it is better to migrate VMs on 
hosts that have low CPU usage so as to not overload and block 
the host. As a result, energy consumption is reduced compared 
to other approaches. 

The below Figure 2 explains live migration; It is the process 
of moving a VM or executing applications between different 
physical machines PMs without disconnecting the client or 
application. The memory and the network connection of VMs 
are transferred from the original PM to the destination host. 

However, hot-migration or live migration can have a 
negative impact on the performance of applications running in 
VM [10]. 

Fig. 2. The concept of live migration 


Chien chich, et al. [14] defined two algorithms; Dynamic 
Resources Allocation (DRA) method and Energy Saving 
method. With a power distribution unit (PDU) connected to 
the system to monitor its status and record energy 

consumption. By the DRA algorithm, the waste of the idle 

resources on the VMs can be decreased. Also, the Energy 

saving method reduces the energy consumption of the cloud 
cluster. More precisely, 39.89% of total energy consumption 
is reduced (also for memory and VCPUs). 

N. Madani, et al. [15] [16] developed an architecture for 
managing VMs in a data center to optimize energy 

consumption by using consolidation (running as many VMs as 
possible in a single physical server with avoiding the lack of 
resources). This solution considers only the CPU as energy 
parameter. 

Xin Lu, et al. [17] explained that the modified best fit 
decreasing (MBFD) algorithm sorts the VMs in decreasing 
order of current usage and allocates each VM to a host that 
provides the least increase in power consumption due to this 
allocation. However, this sorting algorithm is supposed to 
work continuously which generates a large amount of energy 
consumption because of the complexity of the algorithm when 
implementing a large number of VMs and nodes. The 
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proposed model uses the problem of bin packing (idea to 
store items in a minimum number of bins storage, without 
exceeding the capacity of bins at all times) [18]; It defines 
hotspots hosts in the cloud platform by running the algorithm 
selection program. Then, the resources loads of the VMs in 
hot spot hosts are ranked in descending order. For non-hot 
hosts, their resources loads are sorted in ascending order. 
Comparing the model of the scheduling of dynamic migration 
of VMs based on MBFD with NPA (Non Power Aware), 
DVFS (Dynamic Voltage and Frequency Scaling) and ST 
(Simple Threshold) shows that there is more reduction in 
power consumption and migration number; 68% and 38% of 
energy consumption compared to NPA, DVFS and about 13% 
lower for ST [8]. 

N. Sharma, et al. [19] divided past research on energy 
savings into the cloud data center into two main categories 
which are: At the single server level and over several servers. 
At the single server level, Ch. Wu, et al. [20] used an approach 
based on DVFS which takes the task on the basis of its 
priorities and minimum order of resources. It generally 
reduces the energy consumption of the cloud data center. 

It was noticed that the overall use of the data center increases 
with the result of lower energy consumption. However, the 
lower priority task has a slow response time, thus possibility 
of SLA violation. 

Beloglazov,et al. [21] proposed a Modified Best Fit 
Decreasing (MBFD) algorithm to sort the VMs in descending 
order and the PMs in ascending order on the basis of their 
processing capacity. After sorting the VMs and PMs, the 
distribution of the VMs on the PMs is done by using First Fit 
Decreasing (FFD). The limit of this work is that the only 
objective is the distribution of VMs. Also, MBFD is not 
scalable when a large number of requested VMs arrived at the 
data center. 

In, An. Xiong, et al. [22], it was shown that other work uses 
different algorithms for the allocation of VMs. Including bio- 
inspired and nature-inspired algorithms (GA, PSO, OSC, etc.) 
for assigning VMs to the cloud data center. Yet, there are 
various problems of allocation of VMs with energy efficiency 
using PSO. Also, it considers only one type of VM. 

Shangg wang, et al . [23] also proposed a Model of 

placement of VMs in the data center using PSO with energy 
efficiency. Its limitations are that no random reassignment 
which is aware of the energy of the VMs after changes of 
speed (it takes many iterations and gave a non optimal 
solution). 

Hadi Goudarzi et al. [24] presented the Constraint 
Programming which Consider the VM placement problem to 
minimize total energy consumption in a decision time while 
maintaining all VMs in the cloud. Multiple copies of VMs are 
generated by the approach. The algorithm is designed using 
dynamic programming (DP) and local search to evaluate the 
number of copies of VM and then place those copies on the 
servers in order to minimize the cost of total energy in the 
cloud system. The algorithm based on DP: determine the 
number of copies of each VM and assign these VMs to the 
servers. In the local search method, servers are disabled 
depending on their use, and VMs are placed on the rest of the 
servers (if possible) to minimize power consumption as much 


as possible. This approach reduces the cost of energy by up to 
20% compared to previous VM placement techniques. 

A. Choudhary, et al. [25] worked on the efficient use of 
energy resources of the data center which can be achieved in 
two steps. The first one is the efficient placement of VMs and 
the second is optimizing resources allocated in the first step by 
using live migration as the resources demands change. 

The VM Placement aimed to maximize use of available 
resources and save energy; As reported by A. Shankar, et 
al. [26], the energetic VM placement algorithms include: 
Constraint Programming, Bin Packing, Stochastic Integer and 
Programming Genetic Algorithm. 

And in the context of the live Migration of the VMs for 
optimization placement, Anwesha Das [27] described that all 
algorithms that attempt to efficiently allocate resources to 
demand through live migration answer four questions: 

1 Determining when a host is considered overloaded; 

2 Determining when a host is considered under- loaded; 

3 Selecting the VMs to be migrated from an overloaded 
host; 

4 Find a new placement of selected VMs for migration 
from overloaded and underloaded hosts. 

III. SYNTHESIS AND DISCUSION 

The table 1 summarizes the most significant aspects of the 
notable research in the last seven years, related to the 
minimization of energy consumption in datacenter cloud 
computing, ordered by year, and mentioning the strategies and 
measures (or resources considered: i.e., CPU, memory and so 
forth) used, as well as a brief description. 


TABLE I. SOME MAIN EXISTING RESEARCHES RELATED TO 
MINIMIZATION OF ENERGY CONSUMPTION 


References 

/Year 

Strategies 

Measures 

Description 

Mahadevan 
et al.[28] 
/2010 

Consolidating 
server load with 
network 
standby mode 

-CPU 
-Load link 

Server load is 
consolidated to fewer 
servers and unused 
servers and network 
items are disabled. 

Heller et al. 
[29]/2010 

Standby mode 

-Load link 

Selects a set of active 
network elements, then 
disables unused links 
and switches. 

Gmach et 
al. [30]/ 

2010 

Dynamic power 
management 

-Loading 

server 

Historical traces for 
load forecasting, 
migration of workloads 
from overloaded 
servers, shutdown of 
slightly loaded servers. 

Hsu et al. 
[31]/2012 

Virtualization 

-CPU 

Limits CPU usage 
below the threshold and 
consolidate the 
workload between 
virtual clusters. 

B otero et 
all. [32] / 
2012 

Virtual network 
integration 

-Bandwidth 

Selects a set of active 
network elements, then 
disables unused links 
and switches. 

Xu et al. 
[33]/2012 

Green routing 

-Switch 
-Load link 

Uses fewer network 
devices to provide 
routing, inactive 
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References 

/Year 

Strategies 

Measures 

Description 




network devices are 
shut down. 

Shirayanagi 
et al. 

[34] /2012 

VM 

consolidation 
with network 
standby mode 
and bypass 

links 

-CPU load 

Combines the placement 
of VMs with network 
traffic consolidation. 
Derivation links are 
added to meet the 
redundancy 
requirements. 

Fang et al. 
[35] /2012 

VM 

consolidation, 
green routing, 
standby mode 

-Traffic 

rates 

Optimizes placement of 
VMs and routing traffic 
flows through sleep 
planning network 
elements. 

Wang et al. 
[23]/2013 

DVFS 

-Task 

execution 

time 

Non-criticaljob 
execution time is 
extended to reduce CPU 
voltages. 

Beloglazov, 
et al. [21] 
/2013 

MBFD 

FFD (First Fit 
Decreasing) 

-CPU 

Sort VMs in descending 
order and PMs in 
ascending order on the 
basis of their processing 
capacity. After, the 
distribution of VMs on 
PMs is performed using 
FFD. 

Damien 
Borgetto et 
al.[36] / 

2014 

Constraint 
based: SOPVP 

-CPU 

Migrate VM when the 
host is under load. The 
aim of VM migration is 
consolidating servers. 

Ajith 

Singh. N 
latha [37]/ 
2014 

BASIP: banker 
algorithm 
with SIP 

-CPU 

-Memory 

-Bandwidth 

Migrate VM when the 
host is overloaded. The 
objective of VM 
migration is the 
attenuation of hot spots. 

Zehra 

Bagheri, et 
al. 

[38]/2014 

Bin packing: 
Least free 

processing 
element 

-CPU 

Migrate VM when the 
host is overloaded 

The goal of VM 
migration is 
consolidating servers 

N. Madani, 
et al. [15] 
[16] 

/2014 

VMs 

Consolidation 

-CPU 

Architecture for 
managing VM in a data 
center to optimize 
energy consumption by 
grafting a component of 
multiple consolidation 
plans that leads to an 
optimal configuration 

Ch. Wu, et 
al. [20] 

/2014 

DVFS 

-CPU 

This approach takes the 
task on the basis of 
priorities and the 
minimum order of 

resources 

Er. Yashi 
Goyal, et 
al. [9] 

/2015 

Migrate VM on 
hosts that have 
low processor 
utilization 

-CPU 

Approach to select a 

VM for migration and a 
host selection policy to 
reassign this VM 

Sandeep 
Saxena, et 
al. [11] 

/2015 

Classifies 
service requests 
into different 
categories 
based on certain 
parameters and 
assigns each 

request to the 
cluster best 

-Bandwidth 
-Processor 
-Level of 
security 
and trust 

Task scheduling 
architecture for cloud- 
based energy efficiency 
based on assigning 
cloud demand to the 
most appropriate cloud 
server. 


References 

/Year 

Strategies 

Measures 

Description 


suited to the 

corresponding 

category 



Edouard 
Outin, et al. 
[13] 

/2015 

Uses the 

genetic 

algorithm to 

optimize energy 
consumption in 
the cloud and 
learning 
techniques 

-CPU 

Maximize energy 
efficiency by 
consolidating VMs, 
allocating specific 
resources or adjusting 
the use of VMs 

Chien 

chich, et al. 
[14] 

/2015 

-The Dynamic 
Resource 
Allocation 
(DRA) -The 

method of 

energy saving. 

-Memory 

-VCPU 

The DRA can reduce 
the waste of resources in 
the VM and can 
increase the allocation 
of resources of VMs 
with insufficient 
resources. The energy 
saving method reduces 
the energy consumption 
of the cloud cluster. 

Xin Lu, et 
al. [17] 

/2015 

(MBFD: 
modified best 
fit decreasing) 

-CPU 

-Memory 

Sorts the VMs in 
decreasing order of 
current usage and 
allocates each VM to a 
host that provides less 
power consumption 
increase due to this 
allocation 

F. D. Rossi 
et al. [39] 
/2016 

The Energy- 

Efficient Cloud 
Orchestrator 

e-eco. 

Transaction 

s 

per second 
-SLA 

Decides which energy- 
savings technique to 
apply during execution 
with the cloud load 
balancer, to enhance the 
trade-off between 
energy consumption and 
application 
performance. 

Kejing He, 
et al. [40] 
/2016 

-Improved 

Underloaded 

Decision (IUD) 

algorithm and 

Minimum 

Average 

Utilization 

Difference 

(MAUD) 

algorithm. 

-CPU 

-SLA 

Consolidate VMs using: 
in the underloaded 
host decision step, the 

IUD method that is 
based on the overload 
threshold 

of hosts and the average 
utilization of all active 
hosts. 

And in the step of 
selecting the target host 
that can accept the vm 
migration, the algorithm 
MAUD is adopted (that 
is based on the average 
utilization of the data 
center). 

M. A. 

Khoshkhol 
ghi et al. 
[41] 

/2016 

-Energy- 
efficient and 
SLA-aware VM 
consolidation 
mechanism. 

-CPU 

-RAM 

-Bandwidth 

-SLA 

Uses four steps : 
-Overloading host 
detection, and when 

VMs are reallocated to 
other hosts. 

-Underloading host 
detection and when 

VMs are consolidated to 
other hosts. After, 
switching to the sleep 
mode the empty host. 
-Select the VMs to be 
migrated from 
overloaded hosts. 

-Choose the host for the 
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References 

/Year 

Strategies 

Measures 

Description 




selected VMs. 

Riaz Ali, et 
al. [42] 

/2017 

-Energy 
efficient VMR 
(Virtual 

Machine 

Replacement) 

algorithm. 

-Number of 
running 
physical 
servers. 

-Turns off idle PMs to 
energy saving modes 
and then the number of 
running PMs is reduced. 

Rahul 

Yadav, et 
al. [43] 

/2017 

- Power Aware 
Best 

Fit Decreasing 
(PABFD) 
algorithm of 

VM placement. 

-CPU 

-SLA 

-RAM 

Selects VMs to 
consolidate from 
overloaded or 
underloaded host for 
migrating them to 
another appropriate 
host. 

Also, the idle hosts are 
turned into energy 
saving-mode. 


What we can analyses from these studies, is that most of the 
previous studies do not take into account all the major energy 
parameters necessary to ensure the ideal energy efficiency. 

In fact, the energy parameters enclose the CPU, the amount of 
memory, the disk storage space, the quantity of message 
transmitted in the network (bandwidth), the amount of input/ 
output operations per second (IOPS) available on the physical 
support. 

Also, the placement of VMs depends on some defined SLA 
constraints which may be: 

-The affinity constraints between couples of VMs means that 
we need to find an optimal placement respecting the fact that 
two VMs for example, must be placed on the same physical 
server. It is the case of interdependent virtual machines that 
share data with each other in short predefined deadlines. 

-The security constraints may be for example, separating two 
VMs on different servers (or even two data centers). 

-The migration constraints may require performing the 
placement of VMs exclusively on a set of well-defined PMs, 
or even decide to keep a VM on the same server (or even data 
center). 

We defined other energy parameters including the number of 
VMs on the physical machine, the total number of physical 
machines used, the number of reallocations of the VM 
(displacements), the duration of interruption of a VM in the 
migration phase, the percentage of maximum and minimum 
consumption of VMs and the response time of a task hosted 
by a virtual machine (SLA). 

We can also talk about the sustainability of data; each data is 
replicated to multiple hosts / data centers in real time (such as 
a primary and backup host). 


Then, the researches in literature, until that date, are still 
lacking and little attention has been given to have a complete 
solution enclosing all the major energy parameters and which 
covers all possible scenarios and aspects that influence the 
energy consumption. 

IV. Conclusion and future works 

The field of resources management and energy 
consumption is an important and interesting topic in cloud 
computing nowadays. In fact, the data centers consume an 
enormous amount of electrical energy which causes the 
reduction of performances and the emission of a large amount 
of carbon dioxide. In order to improve the use of resources 
and reduce energy consumption, several technologies are 
used, such as server virtualization, migration and 
consolidation. 

In this paper, we presented an analytical study of the 
researches adopted in the literature in the field of the green 
cloud to reduce energy consumption of datacenter and achieve 
application performance. And as future works, we will 
propose a dynamic optimized solution for resources 
management through appropriate allocation of VMs in the 
cloud data center and which considers most of major energy 
parameters and most of possible constraints of VMs allocation 
in PMs. 

References 


[1] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, 
“Cloud computing and emerging IT platforms: Vision, hype, and 
reality for delivering computing as the 5th utility,” Future Gener. 
Comput. Syst., vol. 25, no. 6, pp. 599-616, juin 2009. 

[2] R. Sinha, N. Purohit, and H. Diwanji, Power Aware Live Migration 
for Data Centers in Cloud using Dynamic Threshold. . 

[3] “IDC: Amount of World’s Data Centers to Start Declining in 2017,” 
Data Center Knowledge, 11-Nov-2014. [Online]. Available: 
http://www.dataeenterknowledge.eom/arehives/2014/l 1/11/idc- 
amount-of-worlds-data-centers-to-start-declining-in-2017. [Accessed: 
26-Jan-2018]. 

[4] “Shehabi, A., Smith, S.J., Homer, N., Azevedo, I., Brown, R., 
Koomey, J., Masanet, E., Sartor, D., Herrlin, M., Lintner, W. 2016. 
‘United States Data Center Energy Usage Report’. Lawrence Berkeley 
National Laboratory, Berkeley, California. LBNL-1005775.” . 

[5] L. A. Barroso and U. Holzle, “The Case for Energy-Proportional 
Computing,” Computer, vol. 40, no. 12, pp. 33-37, decembre 2007. 

[6] A. Beloglazov and R. Buyya, “Adaptive Threshold-based Approach 
for Energy-efficient Consolidation of Virtual Machines in Cloud Data 
Centers,” in Proceedings of the 8th International Workshop on 
Middleware for Grids, Clouds and e-Science, New York, NY, USA, 
2010,p.4:l-4:6. 

[7] N. Liu, Z. Dong, and R. Rojas-Cessa, “Task Scheduling and Server 
Provisioning for Energy-Efficient Cloud-Computing Data Centers,” in 
2013 IEEE 33rd International Conference on Distributed Computing 
Systems Workshops, 2013, pp. 226-231. 

[8] Y. Sharma, B. Javadi, W. Si, and D. Sun, “Reliability and energy 
efficiency in cloud computing systems: Survey and taxonomy,” J. 
Netw. Comput. Appl. , vol. 74, pp. 66-85, Oct. 2016. 

[9] Y. Goyal, M. S. Arya, and S. Nagpal, “Energy efficient hybrid policy 
in green cloud computing,” in 2015 International Conference on 
Green Computing and Internet of Things (ICGCIoT), 2015, pp. 1065- 
1069. 

[10] W. Voorsluys, J. Broberg, S. Venugopal, and R. Buyya, “Cost of 
Virtual Machine Live Migration in Clouds: A Performance 


103 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 




International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 2, February 2018 


Evaluation,” in Proceedings of the 1st International Conference on 
Cloud Computing, Berlin, Heidelberg, 2009, pp. 254-265. 

[11] S. Saxena, G. Sanyal, S. Sharma, and S. K. Yadav, “A New Workflow 
Model for Energy Efficient Cloud Tasks Scheduling Architecture,” in 
2015 Second International Conference on Advances in Computing 
and Communication Engineering (ICACCE), 2015, pp. 21-27. 

[12] F. Cao and M. M. Zhu, “Energy Efficient Workflow Job Scheduling 
for Green Cloud,” in 2013 IEEE International Symposium on Parallel 
Distributed Processing, Workshops and Phd Forum, 2013, pp. 2218- 
2221. 

[13] E. Outin, J. E. Dartois, O. Barais, and J. L. Pazat, “Enhancing Cloud 
Energy Models for Optimizing Datacenters Efficiency,” in 2015 
International Conference on Cloud and Autonomic Computing 
(ICCAC), 2015, pp. 93-100. 

[14] C. C. Chen, P. L. Sun, C. T. Yang, J. C. Liu, S. T. Chen, and Z. Y. 
Wan, “Implementation of a Cloud Energy Saving System with Virtual 
Machine Dynamic Resource Allocation Method Based on 
OpenStack,” in 2015 Seventh International Symposium on Parallel 
Architectures, Algorithms and Programming (PAAP), 2015, pp. 190- 
196. 

[15] N. Madani, A. Lebbat, S. Tallal, and H. Medromi, “New cloud 
consolidation architecture for electrical energy consumption 
management,” in AERICON, 2013, 2013, pp. 1-3. 

[16] N. Madani, A. Lebbat, S. Tallal, and H. Medromi, “Power-aware 
Virtual Machines consolidation architecture based on CPU load 
scheduling,” in 2014 IEEE/ACS 11th International Conference on 
Computer Systems and Applications (AICCSA), 2014, pp. 361-365. 

[17] X. Lu and Z. Zhang, “A Virtual Machine Dynamic Migration 
Scheduling Model Based on MBFD Algorithm,” Int. J. Comput. 
Theory Eng., vol. 7, no. 4, p. 278, 2015. 

[18] R. Ren, X. Tang, Y. Li, and W. Cai, “Competitiveness of Dynamic 
Bin Packing for Online Cloud Server Allocation,” IEEEACM Trans. 
Netw., vol. 25, no. 3, pp. 1324-1331, Jun. 2017. 

[19] N. Sharma and R. M. Guddeti, “Multi-Objective Energy Efficient 
Virtual Machines Allocation at the Cloud Data Center,” IEEE Trans. 
Serv. Comput., vol. PP, no. 99, pp. 1-1, 2016. 

[20] C.-M. Wu, R.-S. Chang, and H.-Y. Chan, “A green energy-efficient 
scheduling algorithm using the DVFS technique for cloud 
datacenters,” Future Gener. Comput. Syst., vol. 37, pp. 141-147, 
juillet 2014. 

[21] A. Beloglazov and R. Buyya, “Managing Overloaded Hosts for 
Dynamic Consolidation of Virtual Machines in Cloud Data Centers 
under Quality of Service Constraints,” IEEE Trans. Parallel Distrib. 
Syst., vol. 24, no. 7, pp. 1366-1379, Jul. 2013. 

[22] A. Xiong, C. Xu, A. Xiong, and C. Xu, “Energy Efficient 
Multiresource Allocation of Virtual Machine Based on PSO in Cloud 
Data Center, Energy Efficient Multiresource Allocation of Virtual 
Machine Based on PSO in Cloud Data Center,” Math. Probl. Eng. 
Math. Probl. Eng., vol. 2014, 2014, Jun. 2014. 

[23] S. Wang, Z. Liu, Z. Zheng, Q. Sun, and F. Yang, “Particle Swarm 
Optimization for Energy-Aware Virtual Machine Placement 
Optimization in Virtualized Data Centers,” in Proceedings of the 2013 
International Conference on Parallel and Distributed Systems, 
Washington, DC, USA, 2013, pp. 102-109. 

[24] H. Goudarzi and M. Pedram, “Energy-Efficient Virtual Machine 
Replication and Placement in a Cloud Computing System,” in 2012 
IEEE 5th International Conference on Cloud Computing (CLOUD), 
2012, pp. 750-757. 

[25] A. Choudhary, S. Rana, and K. J. Matahai, “A Critical Analysis of 
Energy Efficient Virtual Machine Placement Techniques and its 
Optimization in a Cloud Computing Environment,” Procedia Comput. 
Sci., vol. 78, pp. 132-138, Jan. 2016. 

[26] Anjana Shankar, “Dissertation on Virtual Machine Placement in 
Computing Clouds; 2010.” 

[27] Anwesha Das, “Project dissertation on A Comparative Study of 
Server Consolidation Algorithms on a Software Framework in a 
Virtualized Environment; 2012.” 

[28] P. Mahadevan, P. Sharma, S. Banerjee, and P. Ranganathan, “Energy 
Aware Network Operations,” in IEEE INFOCOM Workshops 2009, 
2009, pp. 1-6. 

[29] B. Heller et al. , “ElasticTree: Saving Energy in Data Center 
Networks,” in Proceedings of the 7th USENIX Conference on 


Networked Systems Design and Implementation, Berkeley, CA, USA, 
2010, pp. 17-17. 

[30] D. Gmach et al., “Profiling Sustainability of Data Centers,” in 
Proceedings of the 2010 IEEE International Symposium on 
Sustainable Systems and Technology, 2010, pp. 1-6. 

[31] C.-H. Hsu, K. D. Slagter, S.-C. Chen, and Y.-C. Chung, “Optimizing 
Energy Consumption with Task Consolidation in Clouds,” Inf. Sci., 
vol. 258, no. Supplement C, pp. 452^-62, fevrier 2014. 

[32] J. F. Botero, X. Hesselbach, M. Duelli, D. Schlosser, A. Fischer, and 
H. de Meer, “Energy Efficient Virtual Network Embedding,” IEEE 
Commun. Lett., vol. 16, no. 5, pp. 756-759, mai 2012. 

[33] M. Xu, Y. Shang, D. Li, and X. Wang, “Greening data center 
networks with throughput-guaranteed power-aware routing,” Comput. 
Netw., vol. 57, no. 15, pp. 2880-2899, Oct. 2013. 

[34] H. Shirayanagi, H. Yamada, and K. Kono, “Honeyguide: A VM 
migration-aware network topology for saving energy consumption in 
data center networks,” in 2012 IEEE Symposium on Computers and 
Communications (ISCC), 2012, pp. 000460-000467. 

[35] W. Fang, X. Liang, S. Li, L. Chiaraviglio, and N. Xiong, 
“VMPlanner: Optimizing virtual machine placement and traffic flow 
routing to reduce network power costs in cloud data centers,” Comput. 
Netw., vol. 57, no. 1, pp. 179-196, Jan. 2013. 

[36] D. Borgetto and P. Stolf, “An energy efficient approach to virtual 
machines management in cloud computing,” in 2014 IEEE 3rd 
International Conference on Cloud Networking (CloudNet), 2014, pp. 
229-235. 

[37] Ajith Singh. N and M. Hemalatha, “Basip a Virtual Machine 
Placement Technique to Reduce Energy Consumption in Cloud Data 
Centre - Semantic Scholar,” J. Theor. Appl. Inf. Technol., vol. 59, no. 
2, Jan. 2014. 

[38] Z. Bagheri and K. Zamanifar, “Enhancing energy efficiency in 
resource allocation for real-time cloud services,” in 2014 7th 
International Symposium on Telecommunications (1ST), 2014, pp. 
701-706. 

[39] F. D. Rossi, M. G. Xavier, C. A. F. De Rose, R. N. Calheiros, and R. 
Buyya, “E-eco: Performance-aware energy-efficient cloud data center 
orchestration,” J. Netw. Comput. Appl., vol. 78, pp. 83-96, Jan. 2017. 

[40] K. He, Z. Li, D. Deng, and Y. Chen, “Energy-efficient framework for 
virtual machine consolidation in cloud data centers,” China Commun., 
vol. 14, no. 10, pp. 192-201, Oct. 2017. 

[41] M. A. Khoshkholghi, M. N. Derahman, A. Abdullah, S. 
Subramaniam, and M. Othman, “Energy-Efficient Algorithms for 
Dynamic Virtual Machine Consolidation in Cloud Data Centers,” 
IEEE Access, vol. 5, pp. 10709-10722, 2017. 

[42] R. Ali, Y. Shen, X. Huang, J. Zhang, and A. Ali, “VMR: Virtual 
Machine Replacement Algorithm for QoS and Energy-Awareness in 
Cloud Data Centers,” in 2017 IEEE International Conference on 
Computational Science and Engineering (CSE) and IEEE 
International Conference on Embedded and Ubiquitous Computing 
(EUC), 2017, vol. 2, pp. 230-233. 

[43] R. Yadav, W. Zhang, H. Chen, and T. Guo, “MuMs: Energy-Aware 
VM Selection Scheme for Cloud Data Center,” in 2017 28th 
International Workshop on Database and Expert Systems Applications 
(DEXA), 2017, pp. 132-136. 


104 https://sites.google.com/site/ijcsis/ 

ISSN 1947-5500 



International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 2, February 2018 


Privacy Things: 

Systematic Approach to Privacy and Personal Identifiable Information 

Sabah Al-Fedaghi 
Computer Engineering Department 
Kuwait University 
Kuwait 

sabah. alfedaghi@ku.edu.kw 


Abstract —Defining privacy and related notions such as 
Personal Identifiable Information (PII) is a central notion in 
computer science and other fields. The theoretical, technological, 
and application aspects of PII require a framework that provides 
an overview and systematic structure for the discipline’s topics. 
This paper develops a foundation for representing information 
privacy. It introduces a coherent conceptualization of the privacy 
senses built upon diagrammatic representation. A new 
framework is presented based on a flow-based model that 
includes generic operations performed on PII. 

Keywords—Conceptual model', information privacy, 
identification. Personal Identifiable Information (PII), 
identifiers 

I. Introduction 

Privacy has been developed over the years as an applicable 
field of study in engineering systems. According to 
Spiekermann and Cranor [1], “Privacy is a highly relevant 
issue in systems engineering today. Despite increasing 
consciousness about the need to consider privacy in technology 
design, engineers have barely recognized its importance.” 
Privacy engineering is concerned with providing 
methodologies, tools, and techniques for privacy, and it has 
materialized as an emerging discipline as enterprises 
increasingly turn to Internet-based cloud computing. Without 
privacy engineering incorporated into the design, initiation, 
implementation, and maintenance of cloud programs, data 
protection and accessibility standards will become increasingly 
challenging for agencies to properly control [2]. The 2018 EU 
General Data Protection Regulation can require organizations 
to pay a fine (4% of their global annual turnover or €20M, 
whichever is greater) for the most serious infringements of 
privacy regulations. “Privacy laws are suddenly a whole lot 
more costly to ignore” [3]. 

Nevertheless, a 2017 commissioner report [4] complains 
that privacy across the various sectors tends to be quite vague 
and is often expressed in a language that makes it difficult to 
apply. For example, it is protested that Google’s privacy 
policy is too vague for users to control how their information is 
shared. 

The meaning of privacy has been much disputed throughout 
its history in response to wave after wave of new 
technological capabilities and social configurations. The 
current round of disputes over privacy fueled by data science 


has been a cause of despair for many commentators and a 
death knell for privacy itself for others. [5] (Italics added) 

After years of consultation and debate, experts and policy¬ 
makers have developed protection principles for privacy that 
form a shared set of fair information practices and have 
become the basis of personal data or information privacy laws 
in much institutional and professional work across the public 
and private sectors [6]. 

However, these principles have proved less useful with the 
rise of data analytics and machine learning. Informational 
self-determination can hardly be considered a sufficient 
objective, nor individual control a sufficient mechanism, for 
protecting privacy in the face of this new class of 
technologies and attendant threats. [5] 

Individual control offers no protection or remedy [7] against 
techniques such as inference, modern forms of data analysis [8] 
[9] [10], analysis of social media behavior [11], or cross- 
referencing of “de-identified” data [12]. 

According to Jones [13], recognizing the senses in which 
information can be said to be personal “can form a yardstick by 
which to evaluate supporting tools, organizing schemes and 
overall strategies in a practice” of handling Personal 
Identifiable Information (PII). This paper aims at this objective 
of recognizing the senses of PII. “What is PII? Is it personal?” 
“Personal information” typically refers to information that 
uniquely identifies an individual [14]. Waling and Sell [15] 
include the notion of identifiability in their definition: 
“Personal information is all the recorded information about an 
identifiable individual.” 

The important issue in this context is defining the 
elementary constituents or fundamental units of privacy. 
Spiekermann and Cranor [1] use at least 11 terms to name the 
types of “data” involved in privacy: personal data, personally 
identifiable data, personal information, identifying data, 
identifiable personal data, privacy informationidentifying 
information, personally identifiable information, identity 
information, and privacy related information. They do not 
explicitly define these types of data. This is a serious issue 
because the data are things around which privacy revolves. 

The theoretical, technological, and application aspects of 
PII require a framework that provides a general view and a 
systematic structure for the discipline’s topics. This paper uses 
a diagrammatic language called Flowthing Machines (FM) to 
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develop a framework for a firmer foundation and more 
coherent structures in privacy. 

The FM model used in this paper is a diagrammatic 
representation of “things that flow.” Things can refer to a 
range of items including data, information, and signals. Many 
scientific fields use diagrams to depict knowledge and to assist 
in understanding problems. “Today, images are ... considered 
not merely a means to illustrate and popularize knowledge but 
rather a genuine component of the discovery, analysis and 
justification of scientific knowledge” [16]. “It is a quite recent 
movement among philosophers, logicians, cognitive scientists 
and computer scientists to focus on different types of 
representation systems, and much research has been focused on 
diagrammatic representation systems in particular” [17]. 

For the sake of a self-contained paper, we briefly review 
FM, which forms the foundation of the theoretical development 
in this paper. It involves a diagrammatic language that has been 
adopted in several applications [18-22]. The review is followed 
by sections that introduce basic notions that lead to defining of 
PII. Section 3 explores the notion of a signal as a vehicle that 
carries data, which leads to defining data and information 
(section 4), to arrive at the fundamental notion of identifier 
(section 5), thus arriving at privacy concepts and PII. Section 6 
defines PII and leads to an examination of the question, What 
is Privacy? in section 7. The remaining sections analyze types 
of PII, the nature of PII, trivial PII, and sensitive PII. 

II. Flowthing Machines (FM) 

The notion of flow was first propounded by Heraclitus, a 
pre-Socratic Greek philosopher who declared that “everything 
flows.” Plato explained this as, “Everything changes and 
nothing remains still,” where instead of “flows” he used the 
word “changes” [23]. Heraclitus of Ephesus (535-475 BCE) 
was a native of Ephesus, Ionia (near modern Ku§adasi, 
Turkey). He compared existing things to the flow of a river, 
including the observation that you cannot step twice into the 
same river. Flow can also be viewed along the line of “process 
philosophy,” “championed most explicitly by Alfred N. 
Whitehead in his ‘philosophy of organism,’ worked out during 
the early decades of the 20th century” [23]. 

According to Henrich et al. [24], flows can be 
conceptualized as transformation (e.g., inputs transform into 
outputs), 

Anybody having encountered the construction process will 
know that there is a plethora of flows feeding the process. 
Some flows are easily identified, such as materials flow, 
whilst others are less obvious, such as tool availability. 
Some are material while others are non-material, such as 
flows of information, directives, approvals and the weather. 
But all are mandatory for the identification and modelling of 
a sound process. 

Things that flow in FM refer to the exclusive (i.e., being in 
one and only one) conceptual movement among six states 
(stages): transfer, process, create, release, arrive, and accept, 
as shown in Fig. 1. It may be argued that things (e.g., data) can 


also exist in a stored state, which is not included as a stage of 
FM, however, because stored is not a primary state; data can 
be stored after being created, hence becoming stored created 
data , or after being processed and becoming stored processed 
data ,... Current models of software and hardware do not 
differentiate between these states of stored data. The machine 
of Fig. 1 is a generalization of the typical input-process-output 
model used in many scientific fields. 

To exemplify FM, consider flows of a utility such as 
electricity in a city. In the power station, electricity is created , 
then released and transferred through transmission lines to 
city substations, where it arrives. The substations are safety 
zones where electricity is accepted if it is of the right type 
voltage; otherwise it is cut off. Electricity is then processed , as 
in the case of creating different voltage values to be sent 
through different feeders in the power distribution system. 
After that, electricity is released from the distribution 
substation to be transferred to homes. Receive in Fig. 1 refers 
to a combined stage of Arrive and Accept for situations or 
scenarios where arriving things are always accepted. 

The FM diagram is analogous to a map of city streets with 
arrows showing the direction of traffic flows. It is a 
conceptual description because it omits specific details of 
characteristics of things and spheres. All types of 
synchronization, logical notions, constraints, timing, ... can be 
included or superimposed on this conceptual representation, in 
the same way traffic controls, signals, and speed constraints 
can be superimposed on a map of city streets. 

Each type of flow is distinguished and separated from other 
flows. No two streams of flow are mixed, analogous to 
separating lines of electricity and water in blueprints of 
buildings. An FM representation need not include all the 
stages; for example, an archiving system might use only the 
stages Arrive, Accept, and Release. Multiple systems captured 
by FM can interact with each other by triggering events 
related to one another in their spheres and stages. 



Fig. 1. Flowthing machine. 


The fundamental elements of FM are described as follows: 
Things: A thing is defined as what is being created, released, 
transferred, arrived, accepted, and processed while flowing 
within and between machines. For example, heat is a thing 
because it can be created, processed, ... Similarly, time, space, 
a contradictory statement, an electron, events, and noise are all 
things. Mathematical class, members, and numbers are things 
because they can be created, released, transferred, etc. 
“Operations” described in verbs such as generate are not a 
thing but another name for the stage Create. In FM there are 
only the “operations” Create, Process Release, Transfer, and 
Receive (assuming that all arriving things are accepted). Thus, 
“change” or “sort” is Process , “transport” or “send” is 
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Transfer , and a product waiting to be shipped is a Released 
product. 

A machine, as depicted in Fig. 1, comprises the internal flows 
(solid arrows) of things along with the stages and transactions 
among machines. 

Spheres and subspheres are the environments of the thing, 
e.g., the stomach is a food-processing machine in the sphere 
of the digestive system. The machines are embedded in a 
network of spheres in which the processes of flow machines 
take place. A sphere can be a person, an organ, an entity (e.g., 
a company, a customer), a location (a laboratory, a waiting 
room), a communication medium (a channel, a wire). A flow 
machine is a subsphere that embodies the flow; it itself has no 
subspheres. 

Triggering is a transformation (denoted by a dashed arrow) 
from one flow to another, e.g., a flow of electricity triggers a 
flow of air. In FM, we do not say, One element is transformed 
into another , but we say One element is processed to trigger 
the creation of another. An element is never changed into a 
new element; rather, if 1 is a number and 2 is a number, the 
operation ’+’ does not transform 1 and 2 into 3, but ’+’ triggers 
the creation of 3 from input of 1 and 2. 

There are many types of flow things, including data, 
information, money, food, fuel, electrical current, and so forth. 
We will focus on information flow things. 

FM is a modeling language. “A model is a systematic 
representation of an object or event [a thing in FM] in idealized 
and abstract form... The act of abstracting eliminates certain 
details to focus on essential factors” [25]. A model provides a 
vocabulary for discussing certain issues and is thus more like a 
tool for the scientist than for use in, for instance, practical 
systems development [26]. 

We will now introduce basic notions that lead to defining 
PII. To reach this definition, we explore the notion of a signal 
as a vehicle that carries data, a notion that leads to defining 
information, to arrive at the fundamental notion of unique 
identifiers. This provides a way to define privacy and PII. 

III. What Is a Signal? 

The flow of things seems to be a fundamental notion in the 
world. According to NPTEL [27], 

We are all immersed in a sea of signals. All of us from the 
smallest living unit, a cell, to the most complex living 
organism (humans) are all the time receiving signals and 
processing them. Survival of any living organism depends 
on processing the signals appropriately. What is signal ? To 
define this precisely is a difficult task. Anything which 
carries information is a signal... (italics added) 

A signal is typically described as a carrier of message 
content. Thus, fire in the physical sphere creates smoke in the 
physical sphere that flows to the mental sphere to trigger the 
creation of an image or sense of fire. A signal is a carrier 
(itself) that includes content while traveling in a channel and 
may get loaded with noise. Here, creation in the FM model 


indicates the appearance in the communication process of a 
new thing (a carrier full of noise). 

The basic features that differentiate carriers and content 
have fascinated researchers in the communication area. 
According to Reddy [28], “messages” are not contained in the 
signals; “The whole point of the system is that the alternatives 
(in Shannon’s sense) themselves are not mobile, and cannot be 
sent, whereas the energy patterns, the ‘signals’ are mobile.” 
Blackburn [29] insists that “messages are not mobile, while the 
signal is mobile.” In FM, a thing is conceptually mobile since it 
flows. But conceptual flow is different from physical 
movement from one place to another. Flow is not necessarily a 
physical movement; for example, in the sphere of a House , the 
house “flows” from one owner to another. The paper will next 
use an FM diagram to illustrate the notion of signal through its 
content. 

Example: Sang and Zhou [30] extend the BPMN platform to 
include specification of security requirements in a healthcare 
process. They demonstrate this through an example and show 
that BPMN standards cannot express the security requirements 
of such a system because of limitations in these standards; e.g., 
the Healthcare Server needs to execute an authentication 
function before it processes a Doctor’s request. The example 
involves five components: (1) a Healthcare Device, a wearable 
device that senses a patient’s vital functions such as blood 
pressure and heart rate, (2) a Healthcare Server, a cloud server 
that processes the patient’s physical data, (3) a Display Device, 
(4) a Doctor, a medical expert who provides medical services, 
and (5) a Medical Device. Fig. 2 shows a partial view of the 
BPMN representation of the process. 



Fig. 2. BPMN representation (redrawn, partial from [30]) 


Fig. 3 shows the corresponding FM representation of the 
example as we understand it. In the figure, the sensor generates 
(1) data that flow to the server (2) to be processed (3) and 
generate feedback (4). 
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Fig. 3. FM representation of the example. 


The server can create signals (5) to block the transmission 
of data from the sensor (6). The feedback is encrypted (7) and 
flows to the display device to be decrypted and displayed (8). 
The doctor reads the information and tries to login (10). The 
login attempt may fail up to three times (11). After that the 
login is blocked (12). If the login succeeds then the doctor 
inputs medical instructions to the system (14). 

Fig. 3 is a static description. System behavior is modeled in 
terms of events. Here behavior involves the chronology of 
activities that can be identified by orchestrating their sequence 
in their interacting processes. In FM, an event is a thing that 
can be created, processed, released, transferred, and received. 
A thing becomes active in events. An event sphere includes at 
least the event itself, its time, and its region. For example, an 
event in this example is shown in Fig. 4: Error message is sent 
to the doctor. Accordingly, Fig. 5 shows selected events 
occurring in Fig. 3. 
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Fig. 4. The event Error message is sent to the doctor. 
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Fig. 5. Events in the healthcare process scenario. 
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To simplify the diagram we will omit the machines of time 
and of the event itself. The events are: 

Ei’. The sensor sends data to the server that are processed to 
create feedback. 

E 2 : The server blocks data from the sensor. 

E 3 : The feedback is encrypted and sent to the display device. 
E 4 : The doctor reads the displayed information. 

E 5 : The doctor tries to login and the login fails. 

E 6 : The login fails 3 times and is hence blocked. 

E 7 : The login succeeds. 

E 8 : The doctor sends medical instructions. 

Accordingly, control of the system is defined as shown in Fig. 



Fig. 6. Control sequence of the system 


A raw data machine (the flower in Fig. 7) lacks an agent of 
transfer; hence, it rides these signals. Perceiving a flower 
means receiving its signals of color, smell,... A “signal 
machine” is needed to carry it (e.g., rays of vision). 

Consider another example of the four states of matter 
observable in everyday life: solid, liquid, gas, and plasma (see 
Wikipedia). Fig. 8 shows the occurrence of a signal as an 
event. An event can be described in terms of three 
components: 

(i) Event region 

(ii) Event time 

(iii) Event itself as a thing. 

Note that “processing” of the event itself refers to the 
occurrence of the event, and processing of time refers to time 
running its course. 

This event must occur many times before the observing 
agent can reach the conclusion that there are four observable 
states. Accordingly, Fig. 9 shows this repeated experience of 
events until the recurrent information triggers the realization 
that there are only four observable states. 
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IV. What is Information? What Are Data? 


Fig. 7. Information is processed raw data. 


Data are typically described as “raw information” or 
“things that have been given” [31]. In FM, “raw” refers to 
new-ness, a thing that has emerged or been created from 
outside the domain of the FM diagram. These raw data are 
different from manufactured data by processes in the FM 
diagram. The data have the possibility of sliding to become 
the content of a signal; thus the data are (in computer jargon) 
the sender and (part of) the “message” simultaneously, as seen 
in Fig 7. The raw data “ride” the signal to flow to another 
sphere (e.g., to be processed to trigger information). In 
physics, the sound of a bell is cut off in a vacuum because 
there are no signals (waves) to carry it when there is no 
surrounding air. Note that the purpose of this discussion is to 
apply it to persons and their PIIs. 
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matter. 
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Data can also be “manufactured,” as is clear in Shannon’s 
communication model. Fig. 10 shows how information about 
these states is generated from data directly and indirectly. 
First, the four observable states (1) “expose” themselves 
through signals (2) as information (3). Then, with sufficiently 
large events in which these phenomena occur, the informed 
agent can construct codes (4) in the form of data of signals (5) 
that flow to another informed agent (6). Note that in this case 
the data 00, 01, 10, and 11 are intentionally moved to fill the 
signal as its content (i.e., they do not slide to become content 
as in the case of flower). Certain pieces of information form 
identifiers, as described next. 

V. What Is an Identifier? 

Meet Jean Blue, humanoid living in Centerville. Jean is a 
real person, an identity. Jean has many attributes, including 
gender, height, weight, preferred language, capabilities and 
disabilities, citizenship, voter registration, ... [pieces of 
information]. Among these attributes are some identifiers 
... Identifiers are attributes whose values are specific and 
unique to an individual. [32] 


We can use an identifier to refer to recognizing a person 
uniquely. According to Clarke [34], “Persona [identity] refers 
to the public personality that is presented to the world [and] 
supplemented, and to some extent even replaced, by the 
summation of the data available about an individual.” 

Problems occur in relation to the nature of data that 
materialize identifiers. What is “the data available about an 
individual”? Is the datum John F. Kennedy is a very busy 
airport about an individual named John F. Kennedy? Is the 
datum John loves Alice about John or Alice? We will use the 
term identifier to refer to things that identify (recognize) an 
individual uniquely in a specific sphere (context). 

The Aristotelian entity is a single, specific existence (a 
particularity) in the world. In FM, as shown in Fig 11 (circles 
1-3), an identifier of an entity can be its natural descriptors 
(e.g., 6 feet tall, brown eyes, male, blood type A, actions, etc.). 
Accordingly, an identifier is a thing that is processed to 
identify a (natural) person uniquely. Note the context in the 
figure related to PII in space, e.g., location and time. Consider 
the example of a privacy policy given by Finin et al. [35]: 


A person’s identifier can be constructed from things that 
identify (recognize) the person uniquely , e.g., characteristics 
and features. Identifiers are important for establishing the 
particularity or uniqueness of a person necessary for unique 
identification (i.e., recognition of a person). According to the 
Microsoft Word dictionary, identity is “the set of 
characteristics that somebody recognizes as belonging 
uniquely to himself or herself and constituting his or her 
individual personality for life.” Grayson [33] expands this 
definition to include those characteristics about somebody that 
others recognize as well. 

According to Grayson [33], “What we hear about identity 
(the noun) embodies more directly the notion of identify (the 
verb)... These notions are at best incompatible and, in the 
fullest understanding of identity, mutually exclusive.” The 
definition of identity includes “belonging uniquely to . . . and 
constituting his or her individual personality ... for life,” thus 
“more than one identity for a given object means that object 
no longer has a unique identity.” Put simply, if identity 
embodies identification and there are several methods of 
identifying a person, then a definition of identity that includes 
uniqueness seems contradictory. 


Do not allow my social network colleagues group (identity 
context) to take pictures of me (identity context) at parties 
(activity context) held on weekends (time context) at the 
beach house (location context). 

Fig. 12 expresses diagrammatically the prohibited situation: 
Social network colleagues group take pictures of me at parties 
held on weekends at the beach house. 

Consider the set of unique identifiers of persons. 
Ontologically, as mentioned, the Aristotelian entity/object is a 
single, specific existence (a particularity) in the world that 
comprises natural descriptors as communicated by signals 
These descriptors exist in the entity/object. Height and eye 
color, for example, exist as aspects of the existence of an 
entity. 

Some descriptors form identifiers. A natural identifier is a 
set of natural descriptors that facilitate recognizing a person 
uniquely. We create an identifier (e.g., name) for a “specific” 
newborn baby (specific physical place and relationships). An 
identifier can also be created from the activities and actions of 
a person (circles 4 and 5 in Fig. 11). 
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Fig. 12. The specification: Social network colleagues group take pictures of me at parties held on weekends at the beach house. 


Note that an identifier is not necessarily sufficient to 
identify a person uniquely; we also need a recognition machine 
(dotted box in Fig. 11) that connects the identifier to the 
person. In reality, an “identifier” is insufficient to recognize a 
person, e.g., many people share the same name. This 
recognition implies knowing “who somebody is” or “the ability 
to get hold of them” as physical “bodies” [36]. “Simply to 
know a person's name is obviously not to know who that 
person is, even when the name in question is unique. ... We can 
also know who someone is without knowing their name” [36]. 

The dictionary definition of “identification” includes “act 
of identifying” as well as “evidence of identity.” The “act” of 
identifying refers to pointing at or mapping to an individual. 
Similarly, “evidence” of identity refers to mapping this 
evidence to an individual. Typically, the “identity” itself is tied 
to physical existence. The identity of a “real” individual is “the 
individual's legal identity or physical ‘meat space’ location” 
[37], and “to identify the parties to a contract is to make it 
possible to hale them into court if they violate the contract. 
Identity, in other words, is employed as a means of access to a 


person’s body” [36]. Thus, “identity” is something that 
distinguishes one “meat space” from another. This 
“something” is clearly a type of information. It is also “private” 
because it uniquely identifies this ontological space. Hence an 
identifier is the information aspect of the ontological space 
occupied by a human. Names, Social Security number, 
pictures, physical descriptions, fingerprints, and other 
identification devices are pointers to this “ontological space.” 
We can recognize “identity” directly without using any of these 
pointers. When a witness “identifies” an offender from among 
other suspects in a police lineup, the witness recognizes this 
“ontological space” [36]. 

VI. What is Personal Identifiable Information, PII? 

Information privacy “involves the establishment of rules 
governing the collection [in FM, Receive ] and handling [in FM, 
Process ] of personal [in FM, PII] data such as data in credit, 
medical, and government records. It is also known as “data 
protection” [38]. In the strict context of limiting privacy to 
matters involving information, the concept of privacy has been 
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fused with PII protection [39]. In this context, PII denotes 
information about identifiable individuals in accessible form 
[40], 

PII means any information concerning a natural person that, 
because of name, number, symbol, mark, or other identifier, 
can be used to identify that natural person [41]. It includes 
name or any identifiable number attached to it plus any other 
information triggered by signals such as address (location), 
telephone number, sex, race, religion, ethnic origin, sexual 
orientation, medical records, psychiatric history, blood type, 
genetic history, prescription profile, fingerprints, criminal 
record, credit rating, marital status, educational history, place 
of work, personal interests, favorite movies, lifestyle 
preferences, employment record, fiscal duties, insurance, 
ideological, political, or religious activities, commercial 
solvency, banking or saving accounts, real estate rental and 
ownership records. 

Also, PII is “(t)hose facts, communications, or opinions 
which relate to the individual, and which it would be 
reasonable to expect him to regard as intimate or sensitive and 
therefore to want to withhold or at least to restrict their 
collection, use or circulation” [40] (italics added). The British 
Data Protection Act of 1984 defines PII (“personal data”) as 
“information which relates to a living individual who can be 
identified from that information (or from that and other 
information in the possession of the data user), including any 
expression of opinion about the individual but not any 
indication of the intentions of the data user in respect of that 
individual ... which is recorded in a form in which it can be 
processed by equipment operating automatically in response to 
instructions issued for that purpose” [42] (Italics added). The 
assumption here is that this PII is factual information (i.e., not 
libel, slander, or defamation). Jones [13] categorized six 
“senses” of PII (calling it personal data): information that is 
controlled or owned by or about us, directed toward us, sent by 
us, experienced by us, or relevant to us. The U.S. Department 
of Health & Human Services [43] defines PII in an IT system 
or online collection as information (1) that directly identifies an 
individual, or (2) by which an agency intends to identify 
specific individuals in conjunction with other data elements, 
i.e., indirect identification. The U.S. Department of Homeland 
Security (DHS) defines PII as “Any information that permits 
the identity of an individual to be directly or indirectly 
inferred, including any information which is linked or linkable 
to that individual ” [44]. 

These are sample definitions of PII. In the context of FM, 
PII is defined as shown in Fig. 13. A single identifiable person 
is “the physical ‘meat space’ location” [37] and the identifier 
“is employed as a means of access to a person’s body” [36]. 

Personal identifiable information (PII) is vital in today’s 
privacy legislation, according to Schwartz and Solove [45]: 


Create | 


Personal Identifiable Information 

(PII) 


| Create | Identifier 


Fig. 13. Definition of PII 


Personally identifiable information (PII) is one of the most 
central concepts in information privacy regulation. The 
scope of privacy laws typically turns on whether PII is 
involved. The basic assumption behind the applicable laws is 
that if PII is not involved, then there can be no privacy harm. 

VII. What is Privacy? 

The world “private” derives from the Latin privatus , 
meaning “withdrawn from public life” or “deprived of office” 
[46]. The dictionary meaning of privacy includes the state of 
being private and undisturbed, freedom from intrusion or 
public attention, avoidance of publicity, limiting access, and 
the exclusion of others [47]. Privacy supports the conditions 
for a wide range of concepts including seclusion, retirement, 
solitude, isolation, reclusion, solitariness, reclusiveness, 
separation, monasticism, secretiveness, confidentiality, 
intimacy, anonymity, and to be left alone, do as we please, and 
control information about oneself. It is also an umbrella term 
that includes diverse contexts such as private places or 
territorial privacy, private facts or activities, private 
organizations, private issues, private interests, and privacy in 
the information context [48]. In general, privacy is also 
described as “the measure of the extent an individual is 
afforded the social and legal space to develop emotional, 
cognitive, spiritual and moral powers of an autonomous agent” 
[46]. It is “the interest that individuals have in sustaining a 
‘personal space’, free from interference by other people and 
organizations” [49]. 

The notion of privacy as the right to control “ personal” 
information has roots in the concept of individual liberty. 
Philosophically, liberty means freedom from some type of 
control. Liberty implies the ability to control one’s own life in 
terms of work, religion, beliefs, and property, among other 
things. Historically, the right to control one’s own property is a 
significant indicator of liberty. An owner can use, misuse, give 
away or dispose of his or her own property. Similarly, privacy 
is a personal thing “owned” by individuals, and they “control” 
it. Informational privacy is “the right to exercise some measure 
of control over information about oneself’ [50]. 

In FM, we can view privacy on the basis of identifiers; in 
this case, privacy is cutting off sources of manufactured 
identifiers, as shown in Fig. 14. It is a restriction of flows of 
signals between a person and others. Fig. 14 is a version of Fig. 
11, with the identifier machine deleted. Westin has defined 
privacy as the “claim of individuals, ... to determine for 
themselves how, when, and to what extent information about 
them is communicated to others” [50]. 

It is common in the literature to define privacy as Being in 
control of who can access information about the person. This 
concept is represented in Fig. 15, where the release of data 
about a person is triggered by the person him or herself. 
Privacy may also be described as Times when the person is 
completely alone, away from anyone else , as shown in Fig. 16. 

The point here is that the FM language is reasonably 
precise for expressing diverse conceptualizations of what is 
privacy? that can be related and analyzed in a unified 
framework. 
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Fig. 14. Privacy is “cutting off sources” of manufactured identifiers 
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Fig. 15. Privacy is Being in control of information about the person. 
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Fig. 16. Privacy is Times when the person is completely alone, away from 
anyone else. 


VIII. Types OF PII 

In linguistic forms of information, we consider assertion a 
basic component. Language is the main vehicle that describes 
things and their machines in the domain of knowledge. In 
linguistic-based privacy, PII is an element that points uniquely 
to a single person-thing (type person). PII essentially “makes a 
person known,” a potentially sharable entity that can be passed 
along in “further sharing.” The classical treatment of assertion 
(judgment) such as PII divides it into two concepts: subject 
(referent) and predicate that form a logical relation; however, 
FM PII may or may not be a well-structured linguistic 
expression. The linguistic internal structure of any assertion is 
not the element of interest; rather it is its referent. Newton is 
genius , Newton genius , genius Newton , Newton genius is , 
Newton is x, y Newton x —are assertions as long as Newton is 
an identifier. Eventually, even a linguistic expression with one 
word such as Newton is a PII in which the non-referent part is 
empty. 

PII is any information that has referentif) of type natural 
persons. There are two types of personal information: 

(1) Atomic PII (APII) is PII that has a single human referent. 

(2) Compound PII (CPU) is PII that has more than one human 
referent. Fig. 17 shows a binary CPU. A CPU is reducible to a 
set of APIIs and a relationship, as is made clear in Fig. 17 For 
example, the statement John and Mary are in love can be 
privacy reducible to John and someone are in love and 
Someone and Mary are in love. 

In logic (correspondence theory), reference is the relation 
of a word (logical name) to a thing. Every PII refers to its 
referents in the sense that it “leads to” him/her/them as 
distinguishable things in the world. This reference is based on 
his/her/their unique identifier(s). 

A single referent does not necessarily imply a single 
occurrence of a referent. Thus, “ John wounded himself has 
one referent. Referent is a “formal semantics” notion [51 ] built 
on any linguistic structure such that its extension refers to an 
individual (human being). 
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In logic, reference is the relation of a word (logical name) 
to a thing [52][53]. In PII, this thing is limited to human 
beings. In logic language, CPU is a predicate with more than 
one argument. Here the term “predicate” is used loosely since 
internal structure is immaterial. If we restrict FM to the 
language of first order logic, then its predicates are applied to 
logical names that refer to (natural) persons only. A piece of 
APII is a monadic predicate, whereas CPU is a many-place 
predicate. A three-place logical predicate such as give(John, 
George, pen) is a two-place predicate in FM since it includes 
only two individuals. In FM, it is assumed that every many- 
place predicate represents that many monadic predicates. 
Loves (xl, x2) represents loves (xl) and being-loved(x2). 
Accordingly, loves (xl, x2) is private with respect to xl because 
of loves (xl), and it is private with respect to x2 because of 
being-loved(x2). APII is the “source” of privacy. CPU is 
“private” because it embeds APII. 

IX. Proprietorship of PII 

We call the relationship between PII and its referent 
proprietorship , such that the referent is the proprietor. The 
proprietorship of PII is conferred only to its proprietor. CPU is 
proprietary information of its referents: all donors of pieces of 
APII that are embedded in the compound PII. 

Proprietorship is not Ownership. Historically, the rights to 
property were gradually legally extended to intangible 
possession such as processes of the mind, works of literature 
and art, good will, trade secrets, and trademarks [54]. In the 
past and in the present, private property has facilitated a means 
to protect individual privacy and freedom [55]; however, even 
in the nineteenth century it was argued that “the notion of 
privacy is altogether distinct from that of property” [56] . 

A proprietor of PII may or may not be its possessor and 
vice versa. Individuals can be proprietors or possessors of PII; 
however, non-individuals can be only possessors of PII. Every 
piece of APII is a proprietary datum of its referent. 
Proprietorship is a nontransferable right. It is an “inalienable 
right” in the sense that it is inherent in a human being. Others 
may have a “right”’ to it through possessing or legally owning 
it but they are never its proprietor. Proprietorship of PII is 
different from the concept of copyright. 

Copyright refers to the right of ownership, to exclude any 
other person from reproducing, preparing derivative works, 
distributing, performing, displaying, or using the work covered 
by copyright for a specific period of time [57]. In privacy the 
(moral) problem is more than “the improper acquisition and 
use of someone else’s property, and ... the instrumental 
treatment of a human being, who is reduced to numbers and 
lifeless collections of information” [58]. It is also more than 
“the information being somehow embarrassing, shameful, 
ominous, threatening, unpopular or harmful.” Intrusion on 
privacy occurs even “when the information is ... innocuous” 
[58]. “The source of the wrongness is not the consequences, 
nor any general maxim concerning personal privacy, but a lack 
of care and respect for the individual” [58]. Treating PII is 
equivalent to “treating human beings themselves” [58]. 

It is also important to notice the difference between 
proprietorship and knowing of PII. Knowing here is equivalent 
to possession of PII. APII of x is proprietary information of PII 
but it is not necessarily “known” by x (e.g., personal medical 


tests of employees). Possession-based “knowing” is not 
necessarily a cognitive concept. “Knowing” varies in scope; 
thus, at one time there may be a piece of APII “known” only 
by limited number of entities that then becomes “known” by 
more entities. 

The concept of proprietorship is applied to CPU, which 
represents “shared proprietorship” but not necessarily shared 
possession or “knowing.” Some or all proprietors of compound 
private information may not “know” the information. 

X. Trivial PII 

According to our definition of PII, every bit of information 
about a singly identified individual is his/her atomic PII. 
Clearly, much PII is trivial. Newton has two hands , Newton is 
Newton , Newton is a human being , etc. are all trivial bits of PII 
of Newton. Triviality here is the privacy counterpart of 
analytics in logic. Analytical assertions in logic are those 
assertions of which we can determine their truth without 
referring to the source. An assertion such as All human beings 
are mortals is true regardless of who says it. According to 
Kant, an analytical assertion is a priori and does not enlarge 
our knowledge. This does not mean that analytical assertions 
are insignificant; the opposite is true, in that all axioms of logic 
(e.g., principles of contradiction) are of this type. Similarly, 
trivial PII is privacy-insignificant. We will assume that PII is 
non-trivial. 

The definition of PII implies embedding of identifiers. 
While identifiability is a strict measure of PII, sensitivity is a 
notion that is hard to pin down. 

XI. PII Sensitivity 

Spiekermann and Cranor [1] introduce “an analysis of 
privacy sensitive processes” in order to understand “what user 
privacy perceptions and expectations exist and how they might 
be compromised by IT processes ... to understand the level of 
privacy protection that is required.” Accordingly, they claim: 

All information systems typically perform one or more of 
the following tasks: data transfer , data storage and data 
processing. Each of these activities can raise privacy 
concerns. However, their impact on privacy varies 
depending on how they are performed, what type of data 
is involved, who uses the data and in which of the three 
spheres they occur. [Italics added] 

FM introduces a more comprehensive view of these tasks. In 
general, the notion of sensitivity is a particularly difficult 
concept. 

Defining PII as “information identifiable to the 
individual” does not mean that PII is “especially sensitive, 
private, or embarrassing. Rather, it describes a relationship 
between the information and a person, namely that the 
information—whether sensitive or trivial—is somehow 
identifiable to an individual” [59]. The significance of PII 
derives from its privacy value to a human being. 

From an informational point of view, an individual is a 
bundle of his or her PII. PII comes into being not as an 
independent piece of information, but rather as a constitutive 
part of a particular human being [58]. PII ethics is concerned 
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with the “moral consideration” of PII because PH’s “well¬ 
being” is a manifestation of the proprietor’s welfare [60]. 

There is a point that must be exceeded before beginning to 
consider PII sensitive. Social networks depend on the fact that 
individuals willingly publish their own PII, causing more 
dissemination of sensitive PII that compromises individuals’ 
information privacy. This may indicate that PII sensitivity is 
an evolving notion that needs continuous evaluation. On the 
other hand, many Privacy-Enhancing Technologies (PETs) are 
being devised to help individuals protect their privacy [61], 
indicating the need for this notion. 

The sensitivity of PII is a crucial factor in determining an 
individual’s perception of privacy [62]. In many situations, 
sensitivity seems to depend on the context, and this cannot 
always be captured in a mere linguistic analysis; however, this 
does not exclude the possibility of “context-free” sensitivity 
(see [22]). 

A typical definition of sensitivity of PII refers to the 
impact of handling (e.g., disclosing) of PII, as shown in Fig. 
18. 


Context 

Sensitive PII 

Process, e.g., lose, 
compromise, disclose 



ICreate^ 

| Create | Identifier 


I Proprietor 


Impact 

Create, e.g., harm, embarrassment, inconvenience, unfairness 


Fig. 18. Sensitive PII 

XII. Misinformation 

Consider the APII John is honest. Suppose that it is a true 
assertion. Does this imply that John is dishonest , which is 
false, is not PII? Clearly, this is not acceptable. Describing 
John as honest or dishonest is a privacy-related matter 
regardless whether the description is true or not. That is, “non¬ 
information” about an individual is also within the domain of 
his/her privacy. 

XIII. Conclusion 

This paper has defined a fundamental notion of privacy: PII 
based on the notion of “things that flow.” The resultant 
conceptual picture includes signals in communication and 
information and clarifies the sequence of ontological spaces 
and their relationship associated with these concepts. Clarifying 
these concepts is a beneficial contribution to the field of 
information privacy. 

Further work can be directed toward developing a more 
elaborate model of types of privacy, especially in the area of 
sensitivity. Additional work includes PII sharing involving 
proprietors, possessors, and sharers (e.g., senders, receivers) 
of PII. 
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Abstract — This article explains about development of Internet 
of Things (IoT) based decision support for vehicle drivers 
using GPS and GSM modules. This project is helpful to avoid 
the road accidents by maintaining the proper speed limit at 
different locations such as school zones, hospital regions and 
so on. Initially an admin database is created with a web server. 
The data base contains six parts such as S.No, longitude 1, 
latitude 1, longitude2, latitude2, speed limit. The web server 
has been implemented with a PHP page which provides a 
connection to the databases allowing web clients to send 
queries to data base. A PC application is distributed among 
local guides; they can provide speed limits of the allocated 
regions. A GPS receiver is used to provide the vehicle’s 
location and a GSM module is configured as GPRS to provide 
internet connection through mobile data. An Organic Light 
Emitting Diode (OLED) is used to display the speed limit of 
the vehicle’s location. Arduino UNO (At mega 328P) board is 
used to interface all the components. The instructions to the 
vehicle drivers are given by using OLED display when the 
location is tracked by GPRS, and also an alarm sounds at 
extreme conditions. 

Keywords: Adaptation, Cloud, GPS, GSM, IoT, OLED. 

I. INTRODUCTION 

In present days accidents and enforcement of traffic rules are 
becoming major considerations in our modern world. Various 
safety measures such as wearing helmet while driving motor 
cycles, fastening seat belts while driving cars are being 
enforced strictly. Keeping many other factors in check, a 
major factor to be considered regarding safety is speed. To 
provide a limit to the vehicle’s speed, the implementation in 
place up until today are caution signs and speed breakers. This 
thesis provides another solution. By providing a better 
solution. Inspired by the speed limits provided to vehicles, 
here we suggest that a speed limit be kept in place, where the 
limit varies in accordance with vehicle’s speed. 

In the implementation of the different methods many people 
implemented in many ways. Some are used ARM processors, 
some used wireless technologies and some are done by the 
GPS module and adaption techniques and frequency modules. 
Every implementation has their own advantages and 
disadvantages like these are also having some disadvantages. 

[1] and [2] deals with the global positioning System with 
embedded wireless system. The main operation of this 
method is that operate vehicles at critical zones. The total 
implementation is performed based on ARM processor which 
will be at receiver side that is in the vehicle. Paper [3] deals 


with the adaptation technique. In this 2 levels of horns are 
fixed according to the speed limits minimum and maximum. 
Hence normal horn at audible level is he one and if the speed 
is exceeded than the maximum then high level horn is ON. 
So, the driver will limit the speed accordingly. The whole 
proposal of this thesis is based on a database consisting speed 
limits for all the geographical co-ordinates, and internet 
connection. The speed limits present in the database are 
entered based on the road conditions and the location. The 
database does not only return the speed limit of the vehicle’s 
location but also the range of locations in which the speed 
limit is applicable. This reduces the burden of the server to 
repeatedly answer the quires of all the vehicles. 

To implement this approach, we require a database server, a 
web server, a PC application, GPS module, UNO board 
provided by Arduino, GSM module, OLED. This helps drivers 
to maintain their speed so that it would be easy to adapt 
according to their location. Many might argue that having an 
intelligent circuitry such as this in the vehicle will ruin the 
driving experience. If you are maintaining a speed which isn’t 
dangerous, the presence of this intelligent circuitry does not 
affect driver’s experience in any way. If this can be 
implemented in every vehicle present on the roads the fatal or 
serious accidents happening can be drastically reduced, 
resulting in a much safer driving experience. 

Section II of this paper deals with materials and methods 
explained about block diagram, web application and the 
modules used. Section III explains about experimental 
investigation and software’s used. Section IV of this article 
discusses about the experimental results. Section V concludes 
the project. 

II. IMPLEMENTATION OF DECISION SUPPORT OF VEHICLE 
DRIVERS BY GPS AND GSM 



Fig.l. Block diagram of decision support for vehicle 
drivers 
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A. EXPLANATION ABOUT THE CIRCUIT DIAGRAM: calling and data connection. We reconfigure the GSM module 

as GPRS module to connect with mobile data. 


In the above block diagram, we have three main parts Admin 
part, where various servers are developed using a cloud 
platform. This includes the database server and web server. A 
PC application is distributed among few people known as 
local guides who conduct surveys on various geographical 
regions and provide speed limits to those regions. This PC 
application allows local guides to insert rows into the 
database. To provide security to database entries, these local 
guides should be first authorized by the admin and IP 
addresses of their network should be allowed to pass through 
the firewall of the database by the admin.The user part, where 
the database is accessed through a web site by making use of 
network connection provided by GSM module. The request is 
made to the website by forming a URL string that is 
concatenated with the co-ordinates provided by GPS receiver. 
We use OLED to display the speed limit to the vehicle drivers. 



Fig.3. Photo graph of GSM SIM900A Module 


This module requires a SIM to function. AT commands to 
configure GSM as GPRS module is "AT+SAPBR=3, 1, 
V'CONTYPEV', Y'GPRSY'". 


B. GPS MODULE 


D. ARDUINOUNO 


In this implementation we used GPS SIM28ML module. It is 
a standalone GPS receiver which has very good low power 
characteristics. 
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Fig.2. GPS Receiver for tracking the location and latitude 

values 


We use UART communication to retrieve longitudes and 
latitudes from GPS receiver to any microprocessor for further 
processing. The output of the GPS receiver is in the format of 
NMEA data. An example of such data is 

$GPRMC, 235316.000, A, 4003.9040, N, 

10512.5792, W, 0.09, 144.75, 141112, *19 
$GPGGA, 235317.000, 4003.9039, N, 10512.5793, W, 1, 08, 
1.6, 1577.9, M, -20.7, M, 0000*5F 

$GPGSA, A, 3, 22, 18, 21, 06, 03, 09, 24, 15, , , , 2.5, 1.6, 
1.9*3E 


C. GSM MODULE 


UNO board is a platform powered by ATmega328P processor. 
It has 14 digital output/ input pins. This board can be powered 
through USB ports of computers or a 9v battery. It consists of 
a single hardware serial port but can be configured to contain 
multiple software serial ports. We make use of this board to 
interface GPS module, GSM module and OLED. We acquire 
the longitudes and latitudes from the GPS module. We form a 
string that consists, URL to the webpage hosted on the cloud 
concatenated with the data received from the GPS. By making 
use of GSM module we form a http client and acquire data 
from the website. This data will be displayed on OLED. 



Digital Ground 
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USB Plug 


External Power 


Reset Pm 
3.3 Volt Power Pm 
5 Volt Power 


Voltage In 


Serial Out (TX) 
Serial In (RX) 


Reset Button 

In Circuit 
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flkNfftm 

Microcontroller 


Analog In 
Pins (0-5) 


Fig.4. Arduino UNO for interfacing of GPS and GSM 


E. OLED 


Organic light emitting diode(OLED) is used in this approach 
to display the corresponding speed limit of the vehicle 
location. This data is given to the OLED by Arduino UNO 
through I2C communication. 


In this implementation we used GSM SIM900A module. This 
module can be used for various purposes such as messaging, 
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Fig.5. OLED for displaying the output 

OLED is a light emitting technology, prepared by the inserting 
of series of organic thin films between two conductors. When 
the electrical current is practical, a bright light is emitted. 
OLEDs have emissive display, which does not need a 
backlight so these are very thin and more efficient than LCD 
display. This is the single component of the entire circuit on 
the user’s end that is visible to the user. It acts as front end of 
the entire circuit. 


III. Experimental investigations 



Fig.6. Azure Dashboard to create a database server and 
web server for implementation of decision support for 
vehicle drivers 

Azure provides a dashboard to access ah your resources. We 
can create a database server by navigating to New-> 
Databases-> SQL Database. This is shown in Fig.2. After 
entering proper credentials such as Database name, Resource 
group name, Server name and the pricing tier, the entire 
details are shown as in the Fig .3. _ 


Microsoft Azuie 


This implementation requires an admin operating database and 
web servers and ah the permissions to access them. The 
database contains six columns which are S.No, longitude 1, 
latitude 1, longitude2, latitude2, speed limit. The web server 
contains a PHP page which provides a connection to the 
databases allowing web clients to send queries to data base. 

A PC application is distributed among few people known as 
local guides who conduct surveys on various geographical 
regions and provide speed limits to those regions. This PC 
application allows local guides to insert rows into the 
database. In order to provide security to database entries, these 
local guides should be first authorized by the admin and IP 
addresses of their network should be allowed to pass through 
the firewall of the database by the admin. 

The users to which ah this setup is intended for must have the 
following setup embedded into their vehicles. A GPS receiver 
to provide the vehicle’s location, a GSM module configured as 
GPRS to provide internet connection through mobile data, an 
OLED to display the speed limit of the vehicle’s location and 
a UNO board to interface all these components with each 
other. The individual descriptions of ah the modules in the 
project are given below. 

A. MICROSOFT AZURE 

Microsoft AZURE is a cloud platform that provides a platform 
and an interface to create ah the servers required and host 
them. We used the services provided by AZURE to create and 
host a database server and a web server. The process of 
creating a database server is as follows. 

Login to AZURE portal. 



Fig.7. Screen shot of Creating New SQL Database on 
Azure dash board 
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Fig.8.Photo graph of Created new Database 

We can create a new table by navigating to Tools->Query 
editor and logging in using the admin’s username and 
password. A web server can be created in an analogous way. 
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Fig.9. Web APP used to support vehicle drivers 


APP we design a form asking username and password, which 
provides access to another form that, allows local guides to 
insert longitudes and latitudes bounding a region and 
corresponding speed limit. We can create a connection 
between database and APP by using XML connection strings. 
We require server name, database name, admin username and 
password. After successfully building the app, it can be 
distributed among local guides by any means that suits the 
deployment process. 

IV. EXPERIMENTAL RESULTS AND THEIR DISCUSSIONS 


A. ADMIN SIDE 


To host your web page using a web app, we use file transfer 
protocol. We first need to download the profile of the Web 
APP which contains its FTP username and password. We can 
establish the connection using File Explorer. Copy and paste 
publish URL into navigation bar and you will see a pop up 
asking for login. After logging in we can just copy all the files 
we need into the server. 


Log On As 


Either the server does not allow anonymous logins or the e-mail address was not 
accepted. 


FTP server: 
User name: 
Password: 


wa ws-prod-ma 1-001. ftp. azure websites. windows. net 


After you log on, you can add this server to your Favorites and return to it easily. 

/i\ FTP does not encrypt or encode passwords or data before sending them to the 

server. To protect the security of your passwords and data, use WebOAV instead. 


I I Log on anonymously □ Save password 


Fig.10. Ftp Login for usage of web server 

B. VISUAL STUDIO 

Visual Studio is an Integrated Development Environment 
which provides tools required to build apps of all sorts of 
platforms. In this project we made use of visual studio to build 
a windows form APP that can be distributed among local 
guides. We can create a new windows form app by selecting 
one in the create new project option. 



Fig.ll. Creating Windows Form APP to know the latitude 
and longitude values 

After creating a windows form APP, we can have built the 
APP’s look using designer. To provide authentication for the 
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Fig.12. Database for implementation of decision support 
for vehicle drivers 
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Fig.13. Screen shot of created New Database 
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B. PC APPLICATION 



Fig.l6.Screenshot of Local Guide App 


C. USER SIDE 



Fig.l7.Photograph of Hardware implementation of 
decision support for vehicle drivers using IoT 

This paper has 3 parts Admin, Local Guides, and User. 
Database and web pages are hosted by admin. The hosting is 
done on Microsoft azure platform. The local guides have an 
app that is been distributed by the admin. This app allows 
them to insert longitude and latitude bounding an area and 
corresponding speed limit. A local guide can only do so when 
he can access the data base through firewall by the admin. 
Users have GPS module, GSM modem, UNO board which 
acts for the backend and OLED for front end. The location’s 
given by the GPS module is concatenated with the string 
containing the web page name hosted by the admin. This 
string acts as URL to ping the website hosted by the admin. 
The results obtained by this attempt are the longitude and 
latitude bounding the region in which the user is present and 
the corresponding speed limit. This speed limit is displayed 
on the OLED. Until the user is present in the same region a 
ping to the website is not performed again. The same 
procedure repeats when the user crosses the boundaries. This 
approach has been tested on a fixed location and later on a 
two-wheeler vehicle on a entire road where the user crosses 
boundaries of a region and enters into another region. The 
reaction time that is taken to obtain the speed limit of the 


current region in considerable low. Even though the 
implementation can get a few upgrades its timing, the results 
obtained now are considered satisfactory. 

V. CONCLUSIONS 

Ns India is greatly suffering due to accidents. Mostly accidents 
are caused due to the over speed of vehicles. There is a need 
to implement a system which can automatically restrict the 
high speed of the vehicles according to the speed limit 
regulation of particular zones. By this accidents due to over 
speed can minimize. The proposed approach works fine for 
that purpose. It even gives an overall monitoring of the 
vehicles indicating any traffic jams or accidents to the 
officials. This helps the government to get better vision on the 
overall scenario of the roads zones. The control can further be 
divided into zones to give better vision. The system when 
malfunctions, does no harm to the driving experience since 
precaution methods are in place to check any chance of 
having malfunctions. If this system is made compulsory for 
all vehicles, then a noticeable decrease in the figure of road 
accidents would be seen and thus reduces a heavy loss of life 
and poverty in the count. 
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Abstract — In the last few years the fields of both UAV and 
cloud computing has gained the interest of researchers. 
UAV, which can be classified as flying ad-hoc network or 
FANET plays an important role in both military and 
civilian applications, also the cloud computing has gained 
an important role in many applications such as data 
processing and data preservation beside it allows users to 
access all applications and get into data and files from 
any device whenever they are, all of these benefits are 
used with the visibility of internet, our research shows 
also that the user can access without the use of the 
internet by using military network instead of internet in 
the military applications to gain more secure in the 
network. This paper offers proposed algorithms for UAV 
when equipped with cloud computing system in order to 
work as a unique system in military applications. 

I. Introduction 

The wireless ad hoc networks consist of a collection of 
wireless nodes that communicate over a common wireless 
medium. Mobile ad hoc networks are gaining momentum 
because they help realize network services for mobile users 
in areas with no preexisting communications infrastructure 
[1]. Ad hoc Networking enables independent wireless nodes, 
each limited in transmission and processing power, to be as a 
whole providing wider networking coverage and processing 
capabilities .The nodes can also be connected to a fixed- 
backbone network through a dedicated gateway device, 
enabling IP networking services in areas where Internet 
services are not available due to lack of the already exists 
infrastructure. And due to the widely and variety usage of ad- 
hoc networks in many fields such as complex military 
systems as Unmanned Aerial Vehicles (UAVs), the 
performance of these systems have to be more accurate, 
which led us to add cloud computing to the infrastructure of 
the UAVs in order to obtain high accuracy of these systems. 
This field of research lead us to cover the power of cloud 
computing by providing a perspective study on cloud 
computing and sheds light on the ambiguous understanding 
of cloud computing. Cloud computing is not just a service 
being offered from a remote data center. It is a set of 
approaches that can help organizations quickly, effectively 
add and subtract resources in almost real time. Cloud 
computing provides the means through which resources such 
as computing power, computing infrastructure and 
applications can be delivered to users as a service wherever 
and whenever they need over the Internet[2]. 


Cloud computing can be used to overcome the limitations of 
data centers. An enterprise data center is where servers and 
storage are located, operated and managed. A functional data 
center requires a lot of power, a lot of space, cooling, 
maintenance and so on. Most of human activities such as 
energy, lighting, telecommunications, Internet, transport, 
urban traffic, banks, security systems, public health and 
entertainment are controlled by data centers. People rely on 
the functioning and availability of one or multiple data 
centers. The process of adding and releasing resources in the 
traditional data center cannot be done in an automated or 
self- service manner, but in the cloud, users can request extra 
resources on demand and also release them when they are no 
longer needed. The fact that the cloud can easily expand and 
contract is one of the main characteristics that attract users 
and businesses to the cloud. 

The main characteristics that cloud computing offers today 
are cost, virtualization, reliability, security and maintenance, 
but the validity of cloud became more dependable when it is 
combined to the robotics as a cloud robotic system, which 
takes the benefits of both systems cloud and robotics. 

UAV which is a part of ad hoc network as it is classified as 
flying ad hoc network (FANET) can be thought as a kind of 
robotics because network robotic system refers to a group of 
robotic devices that are connected via a wired and/or wireless 
communication network [3]. 

Now a day the military application focuses in the important 
of UAV in reconnaissance and attack roles in order to save 
lives, time and money. However, by supplying UAV systems 
with cloud computing this will streamline operations and 
reducing manning, this leads to investigate in all the 
challenges to equipped cloud computing in UAV system. 


II. Designing an algorithm for the proposed 

SCENARIO 

The algorithm implemented is divided into five main 
parts in which any similar system to our proposed system 
could be examined to show the system quality and 
efficiency. The algorithm designed depends on the 
communication between UAVs and UAVs with the 
ground control station GCS , this lead us to introduce 
arithmetic algorithm for packet transmission between 
UAVs and the ground control station , also the mobility of 
UAVs takes a part in our algorithm showing the velocity 
and UAVs displacement, this part of the algorithm takes a 
great consideration in our discussion for the movement of 
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UAVs which leads us in the algorithm for the system 
connectivity and here we mean by the term system the 
Unmanned Aerial System (UAS).The last main point in 
our algorithm is stated for cloud computing algorithm 
equipped with the UAV system since our scenario is based 
upon UAV cloud computing system. 


a. Designing an algorithm for communication between 
UAVs 

The communication algorithm designed for the proposed 
scenario depends on the communication between UAVs and 
between UAVs and the ground control station , this is done 
by studying the wireless link between the transmitter and 
receiver of UAVs and GCS. The algorithm depends on some 

major parameters such as the aspect angle (^ ) which is the 

. * UAV.'s 

maior process to define the radiation of E antenna 

UAV, 

with respect to J as shown in Fig. 1., the horizontal 
angle ( which can be determined as the angle located 
between roll axis of transmitter E an( t the Line of sight 
(LOS), vertical angle ( which is the angle between LOS 
and the projection of the LOS onto the yaw plane of J 

[4]. 



Fig. 1. Position of UAVs during transmitting and receiving [4], 


Also the algorithm depends on the power received by the 


receiving antenna ( r ), the input power of the transmitting 

p G 

antenna ( f )Uhe gain of the transmitting antenna ( f ), the 

G A 

gain of the receiving antenna ( r ), the wave length ( ) and 

the distance between the two UAVs (R). 

From the above parameters we can calculate our first step in 
the proposed algorithm by calculating the average power 


( r ) as follow: 

p - = w© 


A v2 


( 1 ) 


The second step in the algorithm leads us to use the terms of 
aspect angle in equation (1) as follow: 

p r = P'G'&h. . (2) 

And by taking the algorithm in equation (2) for both sides 
we get equation (3) as follow: 


10log 10 (F r ) = 10log lo P t + 10log 10 (G t (p H ,?v))- 20\og 10 (4nR/X) ( 3 ) 
The third step in the algorithm is to calculate the free space path 

Q 

loss (°) as follow: 

Pr(dP W ) = + G MPi) ~ QcW) ... AA) 

Q 0 (d&)= 2Glog 10 (±f) (5) 

QMP) = 32.4 + 20 log 10 (W) + 20log 10 (d A .J (6) 

The frequency in equation (6) represent the transmission frequency 

and the term (^Km) is related to the distance between the two 
UAVs or between the UAV and the ground control station. 

The fourth step in the algorithm is to calculate the total path loss 

(Qt) depending on the free space path loss (Qo ), log normal 
shadowing effect X{d$) 

and the random variable (D) presenting 
the received envelop of the fast fading signal. 

Qt = Q D +X + 2Dlog 10 (D). (7) 

The fifth step is by applying a condition to calculate the power 
received, this condition is — Qrherahold ) as follow: 

Pr(Q ° - = / 74^t exp (“i? 5 ) ix 

= l-i erfcfa) 

Where is the standard deviation in and a can be represented 
as ^ Qrherahold Qo . 

The sixth step in the algorithm is to determine the probability 
density function of the Ray Leigh distribution P(D) 

and by using 

the random variable presenting the received envelop of the fast 
fading signal (D) and the time average power of the received 

P- 


( 8 ) 


signal 

P(D) =j^exp 


(-£) 


D > 0 




.(9) 

D > 0 


..( 10 ) 


Where A represent the peak amplitude of LOS signal component 

and (■ ) Represent the zero -order modified Bessel function of 
the first kind. 

i_ = — exp(—ysinr) dr 

° 2n J -7t A J .(11) 

The last step in the communication algorithm is by defining the 


ratio ^ 2P as the ratio of Rician K factor, this factor can be 
defined as the factor measures the link quality and lead us to 
know that this ratio is a ratio between the power of LOS and the 
power of (NLOS), the main aspect of this factor is that the 

increase of K lead to the clear of the link with less fading. This 
lead us to make an important condition in the algorithm which is 

before calculating the average power in the Rician fading the K 
factor must be larger than or equal to the ratio of the power of 
LOS and the power of (NLOS). 
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P r =^D 2 P(D)dD = A 2 +2P 2 (12) 


And by replacing ^ KjD f-/ ( K 4 -0 , 


2 P 2 =-^ 

[.k-h. 1 [ n equation (10), the final form of the 
arithmetic algorithm can be written in the form of: 



Fig. 2. Relation between free space path loss and distance. 


Fig.2 shows the relation between free space path loss and the 
distance between two UAVs when applying the algorithm 
using Matlab program. 

b. Arithmetic algorithm for Packet transmission time 

The arithmetic algorithm for packet transmission depends 
upon main points according to IEEE 802.11-1999 which can 
be listed as the Short inter -Frame Space (SIFS), Distributed 
Coordination Function inter -Frame Space (DIFS) which is 
the fundamental Mac technique of IEEE 802.11 based wlan 
standard, back off time (Ack Time).All the listed parameters 
accumulate together forming the main arithmetic algorithm 
for packet transmission as follow: 

T total = F Back off time + ( Datafbytes ) + 28(bytes)) 

X- 3 . —r + SIFS + Over head Time + ACK Time .... 

Data Rate^/sec) ..(14) 
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Fig. 3. DIFS &SIFS [5]. 
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Fig. 4. Time diagram for packet transmission [5]. 


The main issue for this equation is to ensure packet reception 
and avoid the collision between packets and this is done by 
using DIFS &SIFS while the time required for DIFS &SIFS 


are based upon the physical layer. In our scenario we use 
direct sequence spread spectrum DSSS parameters as follow: 


Short inter frame space (SIFS) is 20, the time slot is 
40, Distributed coordination function 
(DIFS) is equal to the following equation. 


slot 

inter frame space 


DIFS = SIFS + 2 X Z 


slot 


(15) 


= 20 + 2 X 40 


= 20 + 80 


= 100/js 

And the back off time can be calculated as 
Back off time = T slot X random (gw) 

= 40 X 31 = 1240 fis 
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Fig. 5. Back off Time [5]. 

According to tables 1, 2, 3 and 4 taken from our implemented 
scenario and by using equation (14) we calculate the total time for 
packet transmission. 

Table 1. Medium access control header 


Frame 

control 

Duration 

Address 1 

Address 2 

Address 3 

Sequence 

control 

(2bytes) 

(2bytes) 

(6bytes) 

(6bytes) 

(6bytes) 

(2bytes) 


Table 2. Medium access control data unit 


Mac header 

Frame body 

FCS 

(24 bytes) 

(2312bytes) 

(4bytes) 


Table 3. ACK frame 


Frame 

Duration 

Receiver 

FCS 

control 


address 


(2bytes) 

(2bytes) 

(6bytes) 

(4bytes) 


Table 4. Physical layer data frame 


Preamble 

Header 

Mac data unit 

(144 bits) 

(48 bits) 

XXX 


*Ttotal ~ + Back off time + £ Data(bytes ) + 28(bytes)) 

X -— t-t—:—- + SIFS + Over ’ head Time + ACK Time 

Data Rate( blt / sec ) 

, , 8 (144 + 48) 8 

T total = 100 + 1240 + (2312 + 28) X — + 20 +-—- + 14 X — 

/ 8 \ 192 112 

= 100 + 1240+ 2340 X — + 10 +-+- 

V ii/ ii ii 
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1870 192 

= 1350+ -+ -+ 


= 1350 + 


11 
19024 

11 


11 


112 

IT 


(17) 


= 3079.45^5 

And by applying the algorithm implemented using Matlab 
program as shown in figures 6, 7 and 8 we can calculate the 
total time for packet transmission and data sent. 
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Fig. 7. Total time for packet transmission 

total Ore for data sent 



Fig. 8. Total time for Data sent 

c. Designing an algorithm for the Mobility 

Due to the classification of UAV as a part of ad-hoc network 
which characterized by high mobility, our algorithm depends 

upon some aspects such as the node velocity^, tuning 

a V 

parameter used to vary randomness , the mean value of n 

as n goes to 00 which is represented by ^ and a random 

variable x n-'l .The main important issue is that each UAV 
is initialized with speed and direction and assuming running 
fixed intervals of time le ads the U AVs to update their speed 
and direction. 



Fig. 9. Gauss-Markov Mobility Model 


In the algorithm implemented the values of speed and direction at 
11 instance of time are calculated based on the values of speed 
and direction at (n-ir 

instance and random variable as 
shown in Fig. 9, which represent the Gauss-Markov Mobility 
Model. From all the above parameters we can calculate the node 
velocity from equation (18). 

V n = aV n-l + (! - + Vl- a 2 * * n—l .... (18) 

But our algorithm will convert equation (18) into three dependable 
aspects in which we can say that if these are considered as UAVs 
or not as follow: 


The first aspect: If ( a = 0) 

This mean that we will obtain random motion and according to this 
equation (18) become as follow 

K= f* + X n- 1 .(19) 

The second aspect: If ( ff = 1) 

This mean that we will obtain linear motion and according to this 
equation (18) become as follow 

V n = V n- 1 .(20) 

The third aspect: If (0 < a < 1) 

This mean that we will obtain intermediate level of randomness 

which represents the UAVs, so if a is input by a variable greater 
than zero and less than one this mean we are going to represent 
UAV motion and this is shown in Fig. 11. 

The last point in this algorithm is to compute the displacement of a 
node n with respect to the node velocity i as 

Sn=l?: 0 1 V i .(21) 

This equation helps us on studying UAV movement. 



Fig. 10. Random way point Mobility 
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Fig. 11. Velocity representation for UAVs 
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d. Designing an algorithm for UAV connectivity 

Designing an algorithm for UAV connectivity lead us to 
represent the survivability of UAVs network, the most 
important issue in the world of UAV is how to maintain the 
uplink and down link between the UAV and its base station 
always up (connected), and we can call this survivability of 
UAVs network. Our algorithm is based upon some 
assumption depending on the proposed scenario as follow: 

1) The status of the node in this network can be damaged 
or undamaged only. 

2) The connection between two nodes is wireless based upon 
mobile radio communication. 

3) Only one node is destroyed or removed every time and 
this node is the most important one in the network, which 
means the worst case occurs every time. 

Our algorithm is based upon the estimation of network 
connectivity in a complete destruction process, meaning that 
the network connectivity is summed with the node being 
removed one by one until the network becomes disconnected. 

Assuming the number of nodes in a given is n, the 

survivability measure (SM) and the connectivity measure 
(CM) for this network is defined as: 

SAf(F) = 2£ , = -„ 1 CM(J0.(22) 

Where CM (0) is the connectivity measure of network F, CM 

p 

(k) is the connectivity measure of network r K which was 
produced by removing the most important node from the 

p 

network r A'-l. 


Also k = 1, 2... m-1 , and m is the number of the nodes 

which have to be removed before the network becomes 
totally disconnected. 


p 

The connectivity measure of network r K is given by: 
i ?] — Jr — l 


CM (k) = Y E"= (+ i N ' C K (k j) 

—'e —1 

Where Nc K (W) is the node connectivity between i and j i 


(23) 


the network 


r K , (n-k) is the number of nodes in network 


1 K 



The node connectivity 


Nc K (iJ) 


can be defined as:- 



(24) 


Where x is the number of the independent paths between nodes 

i and j an( j tQ more accurate the independent paths means 
that there is no common node between them [9]. 

(0 is the number of jumps along the t-th independent path 
between nodes 2 , And as shown in Fig. 12, network 

has three independent paths between nodes an ^j ? thus we 
have X — ^ . 

Path 1: i - j can be thought as i) = i 

Path 2: i - e - j can be thought as JN{2) = 2 

Path 3: i-a-b-c-d-j can be thought as ) — 5 

TVc (i ")_ 1 j 1 j 1 

, and from equation (24) we get K 12 5 

The node connectivity (NC) between node i and j depends not 

only on the number of the independent paths between them but 

also on the jumps of the path if there is a direct link between 

nodes i and j (as path 1), its contribution to L ^ C K (ij) is 1, it 
also means that the survivability of this link is 100%(it is 
assumed that the link cannot be damaged) on the other hand the 

contribution of the path 2 or path 3 to ^ C K (u) is less than 1 
because it is possible to be destroyed. From all the above we can 
say that the more jumps of the path, the less the contribution to 

the Nc Ai’j). 


e. Cloud model 

Our cloud model as we mentioned before is a simple form of 
private cloud depending on the military network instead of the 
internet, the cloud designed depends on a multi-server system 
with a queuing model. The model consists of a single entry point 
ES (Entering Server) which act as a load balancer with a main 
function of forwarding the user requests (military unit) to one of 

the processing server nodes e , where i = 1.m as 

shown in Fig. 13, while the load balancer is represented by 
M/M/1 queue with an arrival and service rate modeled X and L 

P S 

respectively where X< L [6]. And deeply the e node is a core 
node or a processor which represent the physical computational 

P S 

resources were the services are computed, the e is modeled 
as M/M/m queuing system and has a service rate p, so we can 


say that ^ ^e ' 2 -*■ 
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P S 

Each i node connect Data server DS with a probability 5, 
which represent directories and Data bases which access to 
the secondary memory during the service in the cloud model, 
it is noted that DS is modeled by M/M/1 queue with 
exponential arrival and service rates of 8 y and D 
respectively [6]. 

The output server of the cloud architecture OS is represented 
as a node that transmits the response data over the military 
network back to the military unit or in other word the user 
who made the request, here we can name the military unit or 
the user as a client server (CS) who sends requests in an 
exponential distribution with parameter X to the entering 
server (ES) and both OS and CS are modeled by M/M/1 
queue. Our main goal is to compute the response time (T) of 
our cloud model as follow: 


r — T es + TV, + TV, Hr TV, + T si 


1 PS 


DS 


1 OS 


■ CS ' 


(25) 


Where *ES representing the response time of the Entering 
Server (ES) which can be calculated as [7]: 

Vl 


TES — 


i- V; 


(26) 


T 

Where X is the arrival rate and L is the service of ES, PS 
represent the response time of the process servicing node and 
can be calculated as [8]: 

1 C(m, p) 


f ps ~ 


' + 
p mfi 


(27) 


Where m is the number of processing elements, I' is the 

arrival and ^ ~ ^ 2 — 1 — —. m [ s the service 

rates of each processing element while the term ■' 
represent Erlang's C formula [7], which gives the probability 
of a new client joining the M/M/m queue and can be written 
in the following form: 


c (m ,p) - 


imp) 


ml 

-jr 




ym- L {mp) K . i 


kl 


, were 


= Vp 


(28) 


While the response time of Data base server 1 DS which the 
requests are sent to with a probability 8 can be calculated in 
the form of: 


DS ~ 


Vo 


1 - s 7 d 


.(29) 

Where is the arrival rate to DS, r represent the output 
probability as shown in Fig. 14 and D is the service rate. 







Y=$y+(\-&)y 


Client 


Militai'v network 


Fig. 14. Model of the cloud architecture [6]. 

We have to notice here the arrival rate at the output server (OS) 
as the sum of the arrival rates of the two cross point branches 
entering the sum point represented in Fig. 14. 

(1 - S)y + $Y= Y (30) 


And to complete equation (25) we have to compute 1 os 

T T 

and CS ? the OS which represent the response time of the 
output server (OS) can be computed as: 

^ O 


T OS~ , _ y 
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’/(° 


ip) 


Tqs ~ 


O-yF 


(31) 


Where is the service rate of the output server, and deeply 

O is the average band width speed (bytes per second) of the OS 
and F is the averaged size of the data responses of the system. 

T 

And finally CS is the response time of the client server (CS) can 
be calculated in the form of: 


^cs ~ 


Tcs ~ 


lc 


1 he IF) 

F 


c - Y f .(32) 

Were C/F is the service rate, C represent the average band width 
speed of CS in bytes per second and F is the average size in bytes 
of the received files. 


III. Results & Conclusion 

In this paper we made a steps algorithm consists of five steps 
as shown in section two, starting from designing an algorithm for 
the communication between UAVs showing the effect of 
wireless link between the transmitter and the receiver which lead 
us to calculate the average power and calculate the free space 
path loss, that lead us to but the final form of the first algorithm 
for the communication between the UAVs, this is shown in 
equation (13) which is represented in Fig.2 as a relation between 
the distance and free space path loss for the UAVs, this relation 
shows a direct proportion between the distance of the UAVs and 
the free space path loss. 
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The second step is designing an algorithm for packet 
transmission time between the UAVs resulting to show the 
total time for packet transmission as shown in Fig.7, showing 
the total time for data sent as shown in Fig.8, we found from 
this algorithm that the amount of data and the type of data 
effect the time needed for the transmission and this will lead 
us in our future work to study its effect on both transmission 
control protocol and user data gram protocol. 

The third step is designing an algorithm for UAVs mobility 
which lead us to keep in mind that tuning parameter a must 
be vary between 0,1 as 0<a<l to obtain intermediate level of 
randomness which is the closest representation for the UAVs 
as shown in Fig.ll. 

The fourth step is designing an algorithm for UAV 
connectivity which leads us to equation (24), from it we can 
decide if the network is good connectivity or not by the mean 
of survivability. 

The last step is for the cloud model by calculating the 
response time of the cloud in our system to show the real 
time for data to be transmitted. 

These five steps results in an dependable algorithm for our 
system and also this algorithm can be used as a guide for any 
similar system. 
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Abstract — Smartphone is one of the important assets of 
today’s generation it makes people more responsive, 
productive and effective in work and in personal dealings. 
Remarkably it is used as the primary repository of 
individual confidential files because of its portability and 
reliability which provide a scheme to smartphone 
companies to embed security features and users install 
security application freely available in the market. In most 
various studies, facial recognition marked the highest 
security features. So, this study aims to develop a facial 
recognition application specifically for an android phone 
using a local binary histogram algorithm and V-Model to 
process the development of the application. Furthermore, 
this application is tested and evaluated by the experts with 
a score of 4.59 weighted mean “Excellent” based on its 
functionality, reliability, usability, efficiency and 
portability. 

Index Terms — face Recognition, applocker, identity 
theft, security on smartphones. 


I. INTRODUCTION 

ecurity is about confidentiality, availability, and integrity of 
Sdata [24] and this must be protected with reliable solutions 
since information are in placed in various platform offers data 
keeping capability. Majority of the users nowadays preferred 
to keep data using their smartphone devices. Smartphone is 
one the important assets in the 21st century [44] and this lead 
the fast development of smartphone open platform [5] [47] 
design with multi-layered security[36].With the evolution of 
technology and research conducted to strengthen the existing 
solutions in security it produces innovative applications [15] 

[35] [21] . Notably there are various security applications 
embedded and integrated to smartphone devices such as 
patterns, pin, password and face recognition however among 
these security solutions, face recognition ranked the best 
according to recent researchers [14] [21] as this is one of the 
emerging field of research. In fact there are security 
applications freely available in the market [17]. Moreover, 
security application should strong security architectures and 
meticulous security programs [20] and this must be reliable 
and accurate [23] to ensure that vital information is secured 
[34] [2] and protect individual safety and individual property 

[36] . As reported smartphones face an array of threats that 
take advantage of numerous vulnerabilities [19]. These 
vulnerabilities can be the result of inadequate technical 
controls, but they can also result from the poor security 
solutions [27]. The current research on face recognition 
applied different techniques and algorithms to achieve 


recognition rate which address complex variation to make the 
recognition reliable and accurate and this can be used in 
various application [8] like protection in the mobile phone 
which is used to unlock the devices [39]. Moreover face 
recognition features application reduces the risk of forgetting 
passwords and to fasten authentication [34]. Also, it provides 
a strong mechanism to authenticate unique features of the 
authorized users [25], [18], [51]. In this study a Local Binary 
Pattern algorithm is apply for face recognition for its 
implicitly and efficiency [32], [42], [45]. 

A. Project Context 

An estimated number of 3 billion smartphone users are 
expected this 2016 with at least 72% of it are Android users 
[4] worldwide. In Asia, as one of the leading continent in 
developing smartphones, like in Japan [50], South Korea [23], 
Singapore [39] and China [52], it also has the most number of 
Android users compared to the Americas, Europe and 
Australia [31]. In the Philippines, it is considered as the fastest 
growing smartphone country with at least 35% of its 
population using smartphones and 58% of those are Android 
users [10] With these numbers, a greater number of users 
experience personal information and sensitive data leaks. 78% 
of smartphone users had experienced personal identity 
information leaks, including their name, personal files, 
pictures and classified videos [43]. The security of mobile 
phones is then raised to the masses. There are different causes 
affecting the security of mobile phones. Mobile phones often 
lack passwords to authenticate users and control access to data 
stored on the devices increasing its risk that stolen or lost 
phones' information could be accessed by unauthorized users 
who could view sensitive information and misuse mobile 
devices [20]. In other cases, the authentication lacks due to an 
easy pattern combination or pin number [24] which can be 
guessed, forgotten, written down and stolen, or eavesdropped 
[42]. Lastly, to avoid tracking, the phone’s location tracking is 
turned off by most users [50] hindering its capabilities in 
added security in case stolen or misplaced [12].Existing 
security applocks have easy set of authentications, resulting to 
a same factor of security the Android itself offers [17]. In 
contrary, other applocks have a difficult set of authentications 
giving a long time for users to open their phones and 
applications [50]. Furthermore, most of the applocks available 
in the market can be easily removed or uninstalled [11]. The 
poor security authentication of most applocks can still harm 
the protection aimed by the user upon download [20]. The face 
of a human being conveys a lot of information about 
someone’s identity [48]. To make use of its distinctiveness, 
facial recognition is developed [47]. As every person is unique 
upon others, facial recognition offers the most secured 
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protection and authentication for smartphones [46]. Face 
recognition is an interesting and challenging problem and 
impacts important applications in many areas such as 
identification for law enforcement, authentication for security 
system access, and personal identification [28]. Compared to 
passwords, it provides distinctive print to gain access [18]. It 
does not only make stealing of passwords nearly impossible 
but also increases the user-friendliness in human-computer 
interaction [2]. 

B. Statement of the Problem 
General Problem 

The researchers found out that there is no existing high 
security mobile application other than finger print, which is 
only available in few model of android smartphones, that is 
reliable and accurate security tools for securing important data 
and application in mobile devices. 

Specific Problems 

• The Android smartphone’s PIN, Pattern, and Password can 
be easily determined. 

The researchers conducted an interview if they encounter 
password theft, according to them most of their experience 
happens when some relatives and friends see through their 
password. To further investigate, the researchers, conducted a 
survey to 340 respondents asking the users about using 
security applications on how often you change your security 
application using password is Six (6, 1.76%) changes 
regularly or every day; eighty-four (84, 24.71%) changes 
every week; one-hundred forty-five (145, 42.65%) changes 
every month; thirty-six (36, 10.59%) changes every year; and 
thirty-nine (39, 11.47%) rarely change their password. The 
majority of respondents changed their password (PIN, Pattern, 
and Password) monthly which means that they are not 
satisfied to their security so they need to change often. 

• Lack of higher security 
to protect applications for almost of lower to highest model 
of Android smartphones from unauthorized users. 

Most of the people nowadays have important files in their 
devices. And based on result in survey conducted on which 
security application the android users’ using is eighty-two (82 
or 24.12%) used Smart Lock; one-hundred ten (110 or 
32.35%) uses CM Lock; one-hundred twenty-four (124 or 
36.47%) uses Applocker; fourteen (14 or 4.12%) uses Finger 
security; one (1 or 0.29%) uses Privacy Knight; three (3 or 
2.65%) do not use other application. With these results, 
majority used Applocker which means that the researchers 
need to put more attention in securing applications in android 
smartphones for lack of security for apps in any other security 
apps. 

C. Research Objectives 
General Objective 

To develop an android application that will utilize the existing 
security tools, facial recognition, for the selected application 
in android smartphones. 

Specific Objectives 

• To develop an application that applies Face Recognition 
Security using Local Binary Pattern Algorithm. 


The system will use a higher method of security than the 
traditional method such as, pattern, PIN, password, which is 
the face recognition. 

First the user will need to install the APK of the system, it 
must be opened to set the security. Second, the system will ask 
to “Activate Device Administrator”, click “Activate”. Then, it 
will proceed in detecting face to save it as a security. 
Afterwards, it will ask for a PIN to register as a backup or 
alternative security to act as a secondary for the times that the 
face recognition is not applicable with the environment. 

• To develop a system that will secure all application using 
the Start Service and Block method 

The system will give security to those applications which 
the user’s selected. It has the capability to have a “Start 
Service” and “Block” the individual applications from 
opening, both built-in applications and downloaded 
applications. After setting the security, the system now is 
ready to use. The only thing that users need to do is to select 
all the application-installed listed in the system to give a 
security and then click “save”, and it’s done. All the 
application selected will be given a face recognition security 
before it opens. If the face doesn’t match, within 3 attempts, 
on the database, it will be forced to close. 

D. Scope and Limitation 
Scope 

The study focused on face recognition app locker for android 
smartphones. The application includes the following features: 
Security Module - This module allows the user to use the 
facial recognition as password before the selected applications 
to be opened. There will be another security method the PIN 
as alternative if the facial recognition is not applicable in case 
black out mode. 

Application Choice Module - This module allows the user to 
choose applications in which the application would give 
security features. 

Image module - This module, through the use of Open CV, it 
will convert image to string/array. 

Face Recognition Module - This module allows the user to set 
a face as primary security, furthermore this module allows the 
user to scan and compare the face detected before the selected 
applications to be opened. 

Pin Module - This module allows the user to set 4 to 6 
characters as secondary security that can be used in case that 
face recognition isn’t applicable because of the environment. 

Limitations 

• It is unable to recognize subject wearing sunglasses or 
when any object portrait as a barrier to the special facial 
features. 

• The system would only capture around less than 2 meters 
distance. 

• When a face is train in the recognition software, usually 
multiple angles are used (profile, frontal and 45-degree 
are common). Anything less than a frontal view affects 
the algorithm’s capability to generate a template for the 
face. 
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E. Significance of the Study 

The results of this study will be beneficial to the following: 
Users - The study will reduce the identity theft and intrusion 
of privacy in android smartphones. 

Application Developers - This study will serve as a reference 
for application developers in terms of security and protection 
for android applications. 

Researchers - The study will adapt those technical skills 
they’ve learned from their Computer Science Course. 

Future Researchers - The study can be a basis for future 
researchers. 

II. Methodology 

The researchers applied the concept of V-model to ensure the 
accuracy of every stage in the development of the application 
process. 



Fig. 1. Verification Model 

System Requirements Specifications 
In this phase, the researchers identified the requirements 
of the proposed application 
Investigate the present-day conditions 
The researchers conducted a survey to three-hundred 
forty (340) Android smartphone users. The survey 
questionnaire will serve as reference or basis towards the 
development of the solution. 

Identify the requirements 

The researchers identified the requirements through 
identified the hardware and software requirements, the 
process and techniques to be applied for the 
accomplishment of the study. 

High Level Design 

In this phase, the researchers arranged the course of the 
proposed project, the user interface design, and the 
database design. 

Outline of the System Design 

The researchers laid down the concept of the process, 
techniques and strategies with estimated required time of 
completion. The concepts are interpreted through a 
diagram to visually see the flow of the application. 

Low Level Design 

In this phase, the researchers applied the technical aspects 
of the application, the algorithm which detects face 


recognition, the database design, and the system 

architecture 

Implementation 

In this phase, the researchers conducted an 
implementation phase where the application is installed 
and be tested by the experts. All issues relative to 
functionality of the application is immediately be solved 
and identified. 

Coding 

In this phase, the researchers performed the coding 
aspect of the application based on the outline of the 
requirements and classes of modules are reviewed 
carefully. 

Testing 

In this phase, the researchers conducted a series of test to 
make a walkthrough analysis of every phases of the 
module to ensure acceptability and suitability. 
Furthermore the application is evaluated based on ISO 
characteristics. 


A. Algorithm 

The researchers used Local Binary Patterns Algorithm to 
analyze the face images in terms of shape and texture. The 
face area is divided into small regionals then it will be 
extracted and concatenated into a single vector though a 
binary pattern through pixels to efficiently measure the 
similarities between images. LBPH consist of binary patterns 
through pixels. 


Fig. 2. Local Binary Patterns Algorithm 
Fig. 2. The algorithm consists of binary patterns that describe 
the surroundings of pixels in the regions. The obtained features 
from the regions are concatenated into a single feature histogram, 
which forms a representation of the image. Images can then be 
compared by measuring the similarity (distance) between their 
histograms. Because of the way the texture and shape of images 
is described, the method seems to be quite robust against face 
images with different facial expressions, different lightening 
conditions, image rotation and aging of persons. 




III. Results and Discussion 
The researchers specified the requirements needed: 

TABLE I 


System Requirements 


ANDROID OPERATING SYSTEM VERSION 

Android Operating System 

Version 

Minimum: Lollipop 

Maximum: Nougat 

Android database 

SQLite 

Android Programming 

Software 

Android Studio 
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The minimum version requirement of F-Locker is lollipop, for 
this is the lowest version only that is capable for face 
recognition, while the maximum version requirement is 
Nougat. 


Fig. 5 shows that the application detects the face image 
applying the Local Binary Pattern Histogram algorithm. Once 
the face image is detected it will be processed through 
comparing the face image versus the training data sets stored 
in the application. Once matched, the applications will 
automatically open. 



Fig. 3. Conceptual Framework 



Fig. 3 shows the conceptual framework of the F-Locker. The 
OpenCv is utilized to influence the face acknowledgement 
used to capture the image. It will be saved as, from image to 
string/array. The detection of the face in pixels will depend on 
how much was the face area captured. This face image stored 
in database will serve as training data set. It will be used as a 
base-comparing-data for the new face image detected, to 
analyze and compare. The locked application/s will be 
unlocked if and only if it matches the stored face image. 


SQLite 

SQL Database Android Face Recognition User 

Fig. 4. System Architecture 

Fig. 4. shows the system architecture of the application. The 
user of the application will train its face to save it as password 
for his chosen application that he wants to lock. Each nodal 
point that the system gets in his face will be the guide for the 
apps to know if it is the user or not. Every user that trains their 
face will be saved to the database (SQLite) of the application. 
The system will be limited from lollipop to nougat version. 

The F-Locker Application Interface 




C Q 


Fig. 6. PIN Password 

Fig. 6. shows the used of PIN password as alternative security 
application once experiencing blackout and the surrounding 
environment is dark. 



Fig. 7. Applications 

Fig. 7 shows all the application installed in a phone. These 
applications can be locked by selecting the key icon beside. 



Fig. 8. Face Verification 

Fig. 8 shows a notification that the detected face cannot access 
the application which means that the face image does not 
matched the training data sets. 


Figure 5: Face Detection 
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TABLE II 

Software Evaluation Survey Results 


All Characteristics 

Mean 

Verbal Interpretation 

Functionality 

4.63 

Very Satisfactory 

Reliability 

4.64 

Very Satisfactory 

Usability 

4.52 

Very Satisfactory 

Efficiency 

4.62 

Very Satisfactory 

Portability 

4.57 

Very Satisfactory 

Total Weighted Mean 

4.60 

Very Satisfactory 


The researchers submitted the application for software 
evaluation using the ISO 9126 evaluated by the experts. Based 
on result the application marked an average weighted mean of 
4.60WM, “Very Satisfactory” which means that the majority 
of the application meet the specified requirements: 
Functionality marked a weighted mean of 4.63WM, “Very 
Satisfactory”, Reliability marked a weighted mean of 
4.64WM, “Very Satisfactory”, Usability weighted mean of 
4.52 WM, “Very Satisfactory”, Efficiency weighted mean of 
4.62WM, “Very Satisfactory, and Portability with a weighted 
mean of 4.57WM, “Very Satisfactory”. 

IV. Conclusion 

The researchers have concluded that the developed system 
verify enough convenience, suitability, and ease for the user to 
have higher security in their android phones. 

The application offers reliability to reduce identity theft and 
data intrusion for each application installed in their respective 
android smartphone. 

V. Recommendation 

The researchers recommended the following features to 
improve the application. 

• Consider resolving complexity issue like wearing hats, 
different kinds of eye glasses, and environment issues like 
lightning conditions. 

• Used an algorithm to detect all angles of the face. 

• Employed more security features. 
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ABSTRACT 

Vector quantization (VQ) is a powerful technique in the field of digital image compression. The generalized 
residual codebook is used to remove the distortion in the reconstructed image for further enhancing the quality of the 
image. Already, Generalized Residual Vector Quantization (GRVQ) was optimized by Particle Swarm Optimization (PSO) 
and Honey Bee Mating Optimization (HBMO). The performance of GRVQ was degraded due to instability in convergence 
of the PSO algorithm when particle velocity is high and the performance of HBMO algorithm is depended on many 
parameters which are required to tune for reducing size of codebook. So, in this paper the Artificial Plant Optimization 
Algorithm (APOA) is used to optimize the parameters used in GRVQ. The Extensive experiment demonstrates that 
proposed APOA-GRVQ algorithm outperforms than existing algorithm in terms of quantization accuracy and computation 
accuracy. 

Keywords: Vector Quantization, Compression, APOA, GRVQ, PSO and HBMO. 


INTRODUCTION 

In Digital Image, the storing and transferring of 
the large amount of data is the challenging issue in recent 
days because the uncompressed data is occupied large 
amount of data and transmission bandwidth. The image 
compression is mapped the higher dimensional space into 
a lower dimensional space. Image compression is 
categorized as Lossy and Lossless (Chen, S. X., and Li, F. 
W 2012). The original image is completely recovered by 
lossless compression. In a medical field, the Lossless 
compression is efficiently used whereas lossy compression 
is used in natural images and other applications where the 
minor degrade is accepted and get significant decreases in 
bit rate. This paper is focused on Lossy Compression 
technique using VQ for compressing images. 

VQ is a lossy data compression based on 
the principle of block coding (Homg, M. H 2012). It is a 
flxed-to-fixed length encoding algorithm. Applying VQ on 
multimedia is challenging problem due to the handle 
multi-dimensional data. In 1980, Linde, Buzo, and Gray 
(LBG) proposed a VQ algorithm based on training 
sequences (Chen, Y., et.al.2010). The use of training 
sequences bypasses the need for multi-dimensional 
integration. In VQ distance is found among blocks with 
extra fix (Babenko, A., and Lempitsky, V 2014). The 
GRVQ is removed this extra fix by introducing 
regularization on the codebook learning phase (Shicong 
Liu., et.al.2017). The GRVQ reduces the complexity of the 
VQ methods. The main goal of GRVQ is iteratively select 
a codebook and optimize it with the current residual 
vectors and then re-quantize the dataset to obtain the new 
residual vectors for the next iteration. 

The GRVQ was optimized by using PSO and 
HBMO algorithm (Divya, A.., and Sukumaran, S 2017). 
The performance of the PSO was reduced if the particle 
velocity is high it undergoes instability in convergence and 
the HBMO algorithm performance is depend on several 
parameters and many independent parameters are required 
to tune for designing efficient codebook these leads to 
increase the complexity. In order to further improve 
GRVQ in this paper, the APOA is presented to optimize 
the quantization accuracy and computation efficiency. 


In section II, various research methodologies are 
that are to be evaluated are discussed in a detailed manner. 
In section III, discussed and detailed about the proposed 
methodologies, in section IV the results of the proposed 
and existing methodologies are discussed. Finally in 
section V, the conclusion of the research work is 
presented. 

LITERATURE SURVEY 

Homg et al. [HOR11] the new novel method was 
presented based on Honey Bee Mating Optimization 
technique to enhance the performance of LBG 
compression technique. The new method was found the 
optimal result from the training data and constructs the 
codebook based on vector quantization. The performance 
of the proposed method was compared with LBG, PSO- 
LBG and QPSO-LBG algorithms. The result shows, the 
HBMO-LBG was more reliable and reconstmcted images 
get the higher quality than all other algorithms. 

Yang et al. [YAN09] introduced to solve the 
optimization problems. The proposed Cuckoo Search 
Optimization Algorithm was compared with genetic 
algorithm and particle swarm algorithm the result shows 
that the CS was superior for multi model objective 
functions. Moreover, the CS was more robust for many 
optimization problems and can easily extent with multi 
objective optimization applications with several 
constraints even with NP-hard problems. 

Omari et al. [OMA15] presented to improve the 
quality of the reconstructed image after decompression. 
The lossy compression was applied to gain the high 
compression ratios. The proposed approach was used to 
reduce the rational numbers in to the non dominator form 
and enhance the efficiency of the genetic algorithm to find 
the better rational numbers with shorter form. 

Bai et al. [BAI 16] proposed to improve the 
performance of the vector compression. The Multiple 
Stage Residual Model (MSRM) utilized to residual vector 
and improve the image classification. The MSRM with 
VQ was used to adjust the vector compression and deliver 
the higher performance compare with traditional 
algorithms. 
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Tsolakis et al. [TSOI2] presented the fuzzy 
clustering based vector quantization to achieve the optimal 
result of the vector quantization. The c means and fuzzy 
means algorithm was utilized by Fuzzy Clustering Based 
Vector Quantization to handle the limitation of vector 
quantization such as dependence on initialization, high 
computational cost and VQ was required to assign each 
training sample to only one cluster. The result shows 
statistically increase the performance than the classical 
methods, its intensive based on the design parameters, the 
reconstructed images were maintains the high quality in 
terms of distortion measure. 

Enireddy et al. [ENI15] was presented the 
improved cuckoo search with particle swarm optimization 
algorithm to overcome the limitation of medical image 
retrieval problem for compressed images. In this approach, 
the images were compressed using Haar wavelet. The 
features were extracted using the Gabor filter and Sobel 
edge detector. Then the exacted features were classified by 
using the partial recurrent neural network (PRNN). 
Finally, the novel particle swarm optimization (PSO)-CS 
was used for the optimization of the learning rate of the 
neural network. 

Kumar et al. [KUM14] presented the hybrid 
method to integrate Artificial bee colony (ABC) algorithm 
and crossover operator from genetic algorithm with ABC 
for continuous optimization. The proposed method was 
called as CbABC. The result shows that the CbABC 
algorithm was improved the Travelling Salesman Problem 
(TSP) than the traditional ABC algorithm. Then the 
proposed algorithm has the ability to get the local 
minimum and this can be efficiently used for the 
separable, multivariable, multi model function 
optimization. 

Chiranjeevi et al. [CHI 16] proposed the modified 
firefly algorithm based on vector quantization to improve 
the reconstructed image quality and fitness function value 
and reduce the convergence time than the traditional 
firefly algorithm. The proposed Modified Firefly 
Algorithm was increased the brightness of the fireflies 
compare to traditional fireflies to improve the fitness 
function and it’s used to generate the global codebook for 
efficient vector quantization for improving the image 
compression. 

Liu et al. [LIU10] presented the efficient 
compression method to compress the encrypted greyscale 
images through Resolution Progressive Compression 
(RPC) for improving the efficiency of compression 
methods. The result shows the proposed approach has 
better coding efficiency, less computational complexity 
than traditional approaches. In this method, initially the 
encoder sending the down sampled version of cipher text 
after that in the decoder the low resolution image was 
decoded and decrypted. Then, combined all the predicted 
image with the secret encryption secret key which was 
consider as the side information (SI) to decode the 
resolution level. This process was iterated continuously till 
the whole image was decoded. Moreover, the removal of 
Markovian properly in slepian wolf decoding, the 
complexity of decoding was reduced significantly. 

Tsai et al. [TSAI3] presented the fast ant colony 
optimization to handle the issue of codebook generation. 
This method analysed the following two observations, the 
first observation was observed while the convergence 


process of ACO for CGP, patterns or sub solutions were 
achieve the required states at various times. The second 
observation were performed in the most of the patterns 
were allocated to the same code words after the certain 
number of iterations. Based on these observations enhance 
the pattern reduction and speed up the computation time 
of Ant Colony System (ACS) and Code book Generation 
Problem (CGP).The result shows the Fast Ant Colony 
Optimization iteratively reduces the computation time of 
ACS and CGP. 


PROPOSED METHODOLOGY 


In proposed methodology, the GRVQ is 
optimized through APOA to achieve the better 
quantization accuracy and computation efficiency. 
Artificial Plant Optimization based Generalized 
Residual Vector Quantization (APOA-GRVQ) 

In the proposed methodology, the Artificial Plant 
Optimization based Generalized Residual Vector 
Quantization (APOA-GRVQ) is initially applied in the 
codebook. The APOA is optimized the codebook based on 
optimal fitness value of the APOA. In APOA, the fitness 
value is calculated based on the function of Photosynthesis 
and Phototropism. 

The APOA is inspired by natural growth plant 
process. In the APOA, the individual represents one 
potential branch and several operators are adopted during 
the growth period. The photosynthesis operator produces 
the energy created by sunlight and other materials while 
phototropism operator guides the growing direction 
according to various conditions. Additionally, the apical 
dominance operator is essential to make minor adjustment 
for the growth direction. 

In order to simulate the plant growing 
phenomenon it’s important to provide the connection 
between growing process and optimization problem. The 
principle of APOA, the search space should be mapping 
into the whole plant growing environment and the each 
individual mark it as the virtual branch. Moreover, the 
provisions are supplied. For example, water, carbon 
dioxide and other materials are supposed to be 
inexhaustible except the sunlight. Since, the light intensity 
is varying for several branches, it could be consider as the 
fitness value for each branch. 

Photosynthesis produces the energy for the 
branch growing. The rate of the photo synthetic is plays the 
important role to measure how much energy produced. In 
botany, the light response curve is measured the 
photosynthetic rate and many models have been proposed 
in the past research, like rectangular hyperbolic model, 
non rectangular hyperbolic model, updated rectangular 
hyperbolic model, parabola model, straight line model and 
exponential curve models. In this research, the rectangular 
hyperbolic model is utilized to measure the quality of 
obtained energy: 


Pi(o = 


f j(t)Pmax ^ 

aUf t (t)+P max d 


( 1 ) 


Pi(t) -photosynthetic rate of branch I at time t, 
a - Initial quantum efficiency 
P m ax -Maximum photo synthetic rate 
R d - Dark respiratory rate 

The following three parameters a, P max R d are 
controlled the size of the photosynthetic rate. The Uf t (t ) 
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denote as the light intensity and it’s defined as follows the 
equation (2). 


um 


fworst(t ) /t(0 

/worst (0 ~f best (0 


( 2 ) 


The f WO rst(f) an d fbest(f) are the worst and best light 
intensities at time t respectively, f (t) denoted as the light 
intensity of branch i. 

Phototropism is directional growth in which the 
direction of growth is determined by the direction of the 
light source. In APOA branches favor those positions with 
high light intensities so that they can produce more 
energy. Then, each branch will be attracted by these 
positions. Therefore, branch i takes the following growing: 

Xi(t + 1) = Xj(t) + Gp Fj(t) randQ (3) 

The GP is a parameter reflecting the energy 
conversion rate and used to control the growing size per 
unit time. The Fi (t) denotes the growing force guided by 
photo synthetic rate, rand () represents a random number 
sampled with uniformly distribution. 

For each branch I, Fi(t) is computed by the 

F total , v 

Fi(t) = INw-x P a)|| ( x ‘ (t) ~ x p (t) ) (4) 

Where the ||. || describes the Euclidean distance, F[ otal can 
be computed as the following way 

Ff otal (t ) = X;* p coe.e- dimP * (t) - e ~ dimP v^ (5) 


The dim represent the problem dimensionality, 
coe is the parameter used to control the direction: 


Step 3: Calculate the fitness value (light intensity) for 
each codebook 


calc(C) 


N* 


D(c) 5?f 1 z!l‘ 1 u IJ x||x i -c J f 

Cj is jth codeword of size N b in a codebook of size 
N c and u t j is 1 if X t is in 


the jth cluster otherwise zero. 

Step 4: Initially select a codebook randomly and find its 
fitness value. If there is a brighter 

Codebook, then it moves toward the high light 
intensity (highest fitness value) 
based on step 6 to step 8. 

Step 5: Calculate Photosynthesis is as follows 
for i=l to b do 

Computing the light intensity Uf (xi) by using 
Equation (2) 

Computing the photo synthetic rate pi by using 
Equation (1) 

End do 

Step 6: Calculate Phototropism 
for i=l to n do 
if rand 2 () < P b 

Xj(t + 1) <- x t (t) By using Equation (7) 

Else 

x t (t + 1) <- x t (t) With Equation (3) 
end if 
end do 

Step 7: If the number of iteration reaches the maximum 
number of iteration, then stop and display the results. 


( 1 if Pi(t)>P p (t ) 

Coe=j — 1 if i>£(t)<P p (t) (6) 

V 0 Otherwise 

Moreover the small probability Pb is introduced 
to reflect some random events influences. 

T 1) — x m i n + ( x max — x m j n ).randi (), if (rand 2 () 

<Pb) (7) 

The randl () and rand 2 () are two random numbers with 
uniformly distribution, respectively. 

In this work, the each branch is considered as the 
individual codebook. Light intensity is considered as the 
fitness value. Each codebook is optimized based on the 
fitness value. The following algorithm 3.2 describes the 
step by step process how to achieve the optimized APOA- 
GRVQ. 

Algorithmic Steps for APOA-GRVQ 

Initialized APOA parameters m-number of branches 
randomly from the problem search space, which is an n- 
dimensional hypercube. The each branch is considered as 
the individual codebook. 

Input: Initial code book C, number of branches B, number 
of elements K per branch during growing period; 

Initial codebook is represented as C b = 

(4(1). c b (k)l be[B]. 

Output: Optimized codebooks { C b : be[B]} 

Step 1: Encoding of X -> (^(x), i 2 (x),.4( x )) 

Step 2: For each branch i 

Calculate current residual of Xi for each codebook: 
c 

e xi =X-^c b (i b (x)) 

b=l 

Xj Represents the ith input image. 



Fig. 1. Flow Chart of the Proposed Methodology 

RESULT AND DISCUSSION 

Experiments are conducted in MATLAB 
simulation and they are performed on three images such as 
Peacock, Panda and Church. The comparison is performed 
among LBG, Cuckoo-LBG, PSO-GRVQ, HBMO-GRVQ 
and APOA-GRVQ methods. 
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Fig .2 Comparison results of reconstructed image 
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Compression Ratio (C R ) 

Compression Ratio is determined the data 
compression ability by finding the ratio between original 
image (C^) and compressed image (C 2 ). It is defined by, 

Cr (%) = ~ (8) 

Table 1 Comparison of Compression Ratio for 
Peacock, Panda and Church image 


Images 

/ 

Method 

Existing 

Proposed 

LBG 

CUCKOO- 

LBG 

PSO- 

GRVQ 

HBMO- 

GRVQ 

APOA- 

GRVQ 

Peacock 

27.9561 

28.8594 

29.8261 

33.8256 

41.8846 

Panda 

26.7577 

28.1826 

36.4846 

38.5059 

45.4231 

Church 

28.2912 

29.1586 

39.7180 

40.1846 

44.2142 


Compression Ratio 



Fig 3 Comparison Result of Compression 
Ratio 

Figure 3 shows the comparison results of 
proposed APOA-GRVQ technique with existing LBG, 
Cuckoo-LBG, PSO-GRVQ and HBMO-GRVQ in terms of 
compression ratio. X axis is taken as various compression 
methods and Y axis is taken as compression ratio in 
percentage. From the bar chart the proposed The APOA- 
GRVQ techniques give better high compression ratio for 
Peacock, Panda and church image. 

Structure Similarity 

Structural similarity measure depends on the 
human visual system, that combines the structure, 
luminance and contrast information for assessing the 
visual quality of decompressed image. 


Table 2 Comparison of structure similarity for 
Peacock, Panda and Church image 


Images / 

Method 

Existing 

Proposed 

LBG 

CUCKO 

O-LBG 

PSO- 

GRVQ 

HBMO- 

GRVQ 

APOA- 

GRVQ 

Peacock 

0.9036 

0.9114 

0.9191 

0.9212 

0.9164 

Panda 

0.8618 

0.9084 

0.9368 

0.9593 

1.0414 

Church 

0.9057 

0.9138 

0.9213 

0.9183 

0.9241 


Structure Similarity 



Fig 4 Comparison Result of Structure 
Similarity 

Figure 4 shows the comparison results of 
proposed APOA-GRVQ technique with existing LBG, 
Cuckoo-LBG, PSO-GRVQ and HBMO-GRVQ in terms of 
structure similarity. X axis is taken as various compression 
methods and Y axis is taken as structure similarity. From 
the bar chart the proposed The APOA-GRVQ techniques 
give better high structure similarity for Peacock, Panda 
and church image. 

Peak Signal Noise Ratio (PSNR) 

The PSNR is quality measurement between the 
original and a compressed image. The higher PSNR, value 
represents the best quality of the decompressed image. 

PSNR(dB) = 101o glo (^) (9) 

R is the maximum peak pixel values of input 
image.MSE is mean square error between input and 
decompressed image. 
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Table 3 Comparison of Peak Signal Noise Ratio for 
Peacock, Panda and Church image 


Images / 
Method 

Existing 

Proposed 

LBG 

CUCKOO- 

LBG 

PSO- 

GRVQ 

HBMO- 

GRVQ 

APOA- 

GRVQ 

Peacock 

31.0356 

31.8116 

32.1223 

32.2836 

34.7935 

Panda 

29.1514 

30.6551 

33.2738 

35.8325 

37.4321 

Church 

31.3487 

32.1164 

32.3647 

32.6504 

35.1425 


Peak Signal Noise Ratio 



Fig 5 Comparison Result of Peak Signal Noise 
Ratio 

Figure 5 shows the comparison results of 
proposed APOA-GRVQ technique with existing LBG, 
Cuckoo-LBG, PSO-GRVQ and HBMO-GRVQ in terms of 
peak signal noise ratio. X axis is taken as various 
compression methods and Y axis is taken as peak signal 
noise ratio values. From the bar chart, the proposed 
APOA-GRVQ techniques gives better high peak signal 
noise ratio for Peacock, Panda and church image. 

Bit Rate (kb/s) 

Amount of data processed in a given time is 
termed as Bit rate. The measurement of bit rate is bits per 
second, kilobits per second, or megabits per second. 


Table 4 Comparison of Bit Rate for Peacock, Panda 
and Church image 


Images / 
Method 

Existing 

Proposed 

LBG 

CUCKOO- 

LBG 

PSO- 

GRVQ 

HBMO- 

GRVQ 

APOA- 

GRVQ 

Peacock 

2.5105 

5.2211 

5.7399 

7.7182 

8.7124 

Panda 

5.2211 

4.3229 

5.1602 

7.8156 

8.4314 

Church 

2.4857 

4.0764 

5.0628 

7.0609 

8.0864 


Bit Rate 



Fig 6 Comparison Result of Bit Rate 

Figure 6 shows the comparison results of 
proposed APOA-GRVQ technique with existing LBG, 
Cuckoo-LBG, PSO-GRVQ and HBMO-GRVQ in terms of 
Bit Rate. X axis is taken as various compression methods 
and Y axis is taken as Bit rate values. From the bar chart 
the proposed The APOA-GRVQ techniques give better 
high bit rate for Peacock, Panda and church image. 

CONCLUSION 

The proposed APOA is efficiently optimized the 
GRVQ. The proposed APOA algorithm is very easy to 
implement and efficiently solved the issue whose optimal 
solutions are in the feasible region. In APOA, each branch 
represents an individual codebook which provides the 
potential solution. Furthermore, the photosynthesis 
operator is efficiently found the energy while 
phototropism operator guides the growing directions. The 
experimental results showed the proposed algorithms can 
increases the quality of images with respect to existing 
algorithms such as HBMO-GRVQ, PSO-GRVQ, Cuckoo- 
LBG and LBG. The proposed APOA-GRVQ algorithm 
can provide better quantization accuracy and computation 
accuracy in term of following performance parameters 
such as Compression Ratio, Peak-Signal Noise Ratio 
(PSNR), Structural Similarity, and Bit Rate. 
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